ptrdvn commited on
Commit
51c069c
1 Parent(s): ab8da27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -8
README.md CHANGED
@@ -8,7 +8,6 @@ license: mit
8
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
9
  </p>
10
 
11
-
12
  A text scorer which scores text based on the amount of useful, textbook-like information in it.
13
  It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
14
 
@@ -18,13 +17,14 @@ This scorer can be used to filter useful information from large text corpora in
18
 
19
  # How to install
20
 
21
- ```bash
22
- pip install git+https://github.com/lightblue-tech/shitsu.git
23
- ```
24
 
25
  # How to use
26
 
27
- With our scorer package
 
 
 
 
28
 
29
  ```python
30
  from shitsu import ShitsuScorer
@@ -41,9 +41,12 @@ scores
41
  # array([ 0.9897383 , -0.08109612], dtype=float32)
42
  ```
43
 
44
- Without our scorer package (i.e. without pip install)
45
 
46
- ```python
 
 
 
47
 
48
  from safetensors.torch import load_model
49
  import fasttext
@@ -91,6 +94,11 @@ scores
91
  # array([ 0.9897383 , -0.08109612], dtype=float32)
92
  ```
93
 
 
 
 
 
 
94
  # How we made the training data
95
 
96
  We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
@@ -109,4 +117,4 @@ This resulted in the dataset found at [lightblue/text_ratings](https://huggingfa
109
 
110
  We then trained a small neural network on top of fasttext's embeddings to predict these scores.
111
 
112
- We chose the languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.
 
8
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
9
  </p>
10
 
 
11
  A text scorer which scores text based on the amount of useful, textbook-like information in it.
12
  It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
13
 
 
17
 
18
  # How to install
19
 
 
 
 
20
 
21
  # How to use
22
 
23
+ ### With our scorer package
24
+
25
+ ```bash
26
+ pip install git+https://github.com/lightblue-tech/shitsu.git
27
+ ```
28
 
29
  ```python
30
  from shitsu import ShitsuScorer
 
41
  # array([ 0.9897383 , -0.08109612], dtype=float32)
42
  ```
43
 
44
+ ### Without our scorer package (i.e. without pip install)
45
 
46
+ <details>
47
+ <summary>Show full code</summary>
48
+
49
+ ```python
50
 
51
  from safetensors.torch import load_model
52
  import fasttext
 
94
  # array([ 0.9897383 , -0.08109612], dtype=float32)
95
  ```
96
 
97
+ </details>
98
+
99
+
100
+
101
+
102
  # How we made the training data
103
 
104
  We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
 
117
 
118
  We then trained a small neural network on top of fasttext's embeddings to predict these scores.
119
 
120
+ We chose the 44 languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.