lightblue
/

shitsu_text_scorer

Model card Files Files and versions Community

ptrdvn commited on Aug 28

Commit

51c069c

•

1 Parent(s): ab8da27

Update README.md

Files changed (1) hide show

README.md +16 -8

README.md CHANGED Viewed

@@ -8,7 +8,6 @@ license: mit
     <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
 </p>
 A text scorer which scores text based on the amount of useful, textbook-like information in it.
 It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
@@ -18,13 +17,14 @@ This scorer can be used to filter useful information from large text corpora in
 # How to install
-```bash
-pip install git+https://github.com/lightblue-tech/shitsu.git
-```
 # How to use
-With our scorer package
 ```python
 from shitsu import ShitsuScorer
@@ -41,9 +41,12 @@ scores
 # array([ 0.9897383 , -0.08109612], dtype=float32)
 ```
-Without our scorer package (i.e. without pip install)
-```python
 from safetensors.torch import load_model
 import fasttext
@@ -91,6 +94,11 @@ scores
 # array([ 0.9897383 , -0.08109612], dtype=float32)
 ```
 # How we made the training data
 We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
@@ -109,4 +117,4 @@ This resulted in the dataset found at [lightblue/text_ratings](https://huggingfa
 We then trained a small neural network on top of fasttext's embeddings to predict these scores.
-We chose the languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.

     <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
 </p>
 A text scorer which scores text based on the amount of useful, textbook-like information in it.
 It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
 # How to install
 # How to use
+### With our scorer package
+```bash
+pip install git+https://github.com/lightblue-tech/shitsu.git
+```
 ```python
 from shitsu import ShitsuScorer
 # array([ 0.9897383 , -0.08109612], dtype=float32)
 ```
+### Without our scorer package (i.e. without pip install)
+<details>
+  <summary>Show full code</summary>
+  ```python
 from safetensors.torch import load_model
 import fasttext
 # array([ 0.9897383 , -0.08109612], dtype=float32)
 ```
+</details>
 # How we made the training data
 We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
 We then trained a small neural network on top of fasttext's embeddings to predict these scores.
+We chose the 44 languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.