Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,6 @@ license: mit
|
|
8 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
|
9 |
</p>
|
10 |
|
11 |
-
|
12 |
A text scorer which scores text based on the amount of useful, textbook-like information in it.
|
13 |
It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
|
14 |
|
@@ -18,13 +17,14 @@ This scorer can be used to filter useful information from large text corpora in
|
|
18 |
|
19 |
# How to install
|
20 |
|
21 |
-
```bash
|
22 |
-
pip install git+https://github.com/lightblue-tech/shitsu.git
|
23 |
-
```
|
24 |
|
25 |
# How to use
|
26 |
|
27 |
-
With our scorer package
|
|
|
|
|
|
|
|
|
28 |
|
29 |
```python
|
30 |
from shitsu import ShitsuScorer
|
@@ -41,9 +41,12 @@ scores
|
|
41 |
# array([ 0.9897383 , -0.08109612], dtype=float32)
|
42 |
```
|
43 |
|
44 |
-
Without our scorer package (i.e. without pip install)
|
45 |
|
46 |
-
|
|
|
|
|
|
|
47 |
|
48 |
from safetensors.torch import load_model
|
49 |
import fasttext
|
@@ -91,6 +94,11 @@ scores
|
|
91 |
# array([ 0.9897383 , -0.08109612], dtype=float32)
|
92 |
```
|
93 |
|
|
|
|
|
|
|
|
|
|
|
94 |
# How we made the training data
|
95 |
|
96 |
We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
|
@@ -109,4 +117,4 @@ This resulted in the dataset found at [lightblue/text_ratings](https://huggingfa
|
|
109 |
|
110 |
We then trained a small neural network on top of fasttext's embeddings to predict these scores.
|
111 |
|
112 |
-
We chose the languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.
|
|
|
8 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
|
9 |
</p>
|
10 |
|
|
|
11 |
A text scorer which scores text based on the amount of useful, textbook-like information in it.
|
12 |
It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
|
13 |
|
|
|
17 |
|
18 |
# How to install
|
19 |
|
|
|
|
|
|
|
20 |
|
21 |
# How to use
|
22 |
|
23 |
+
### With our scorer package
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install git+https://github.com/lightblue-tech/shitsu.git
|
27 |
+
```
|
28 |
|
29 |
```python
|
30 |
from shitsu import ShitsuScorer
|
|
|
41 |
# array([ 0.9897383 , -0.08109612], dtype=float32)
|
42 |
```
|
43 |
|
44 |
+
### Without our scorer package (i.e. without pip install)
|
45 |
|
46 |
+
<details>
|
47 |
+
<summary>Show full code</summary>
|
48 |
+
|
49 |
+
```python
|
50 |
|
51 |
from safetensors.torch import load_model
|
52 |
import fasttext
|
|
|
94 |
# array([ 0.9897383 , -0.08109612], dtype=float32)
|
95 |
```
|
96 |
|
97 |
+
</details>
|
98 |
+
|
99 |
+
|
100 |
+
|
101 |
+
|
102 |
# How we made the training data
|
103 |
|
104 |
We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
|
|
|
117 |
|
118 |
We then trained a small neural network on top of fasttext's embeddings to predict these scores.
|
119 |
|
120 |
+
We chose the 44 languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.
|