Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,61 @@
|
|
1 |
---
|
2 |
license: bigscience-bloom-rail-1.0
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: bigscience-bloom-rail-1.0
|
3 |
+
language:
|
4 |
+
- ar
|
5 |
+
- en
|
6 |
---
|
7 |
+
|
8 |
+
### **Noon - a 7-billion parameter Arabic Large Language Model**
|
9 |
+
|
10 |
+
We present the 7-billion parameter variant of **Noon**, an Arabic Large Language model based on **BLOOM**, a foundation model released by the [bigscience](https://huggingface.co/bigscience) workshop.
|
11 |
+
|
12 |
+
Noon was trained with the main focus of having a model that responds to various types of instructions and questions (text generation, code generation, mathematical problems, closed/open-book questions, etc.)
|
13 |
+
|
14 |
+
We trained the model using the ColossalAI framework which fully supports the HuggingFace library models, and implements different optimization and quantization techniques for billion-scale LLMs.
|
15 |
+
|
16 |
+
The training data is a combination of Arabic datasets covering multiple tasks, more details are provided in the dataset section.
|
17 |
+
|
18 |
+
### **Usage**
|
19 |
+
|
20 |
+
The usage of our model only requires the Transformers library, and can be loaded as follows:
|
21 |
+
|
22 |
+
```python
|
23 |
+
from transformers import BloomTokenizerFast, BloomForCausalLM, pipeline
|
24 |
+
|
25 |
+
|
26 |
+
text="اكتب مقالا من عدة أسطر عن الذكاء الصناعي وتطوراته"
|
27 |
+
prompt = f'Instruction:\n{text}\n\nResponse:'
|
28 |
+
|
29 |
+
model = BloomForCausalLM.from_pretrained('Naseej/noon-7b')
|
30 |
+
|
31 |
+
# We use the pretrained BLOOM tokenizer as it provides full support for the Arabic vocabulary
|
32 |
+
tokenizer = BloomTokenizerFast.from_pretrained('bigscience/bloomz-7b1-mt')
|
33 |
+
|
34 |
+
generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)
|
35 |
+
|
36 |
+
# We recommend the provided hyperparameters for generation
|
37 |
+
# But encourage you to try different values
|
38 |
+
response = generation_pipeline(prompt,
|
39 |
+
pad_token_id=tokenizer.eos_token_id,
|
40 |
+
num_beams=4,
|
41 |
+
max_length=500,
|
42 |
+
top_p=0.9,
|
43 |
+
repetition_penalty = 3.0,
|
44 |
+
no_repeat_ngram_size = 3)[0]['generated_text']
|
45 |
+
|
46 |
+
print(response)
|
47 |
+
```
|
48 |
+
|
49 |
+
### **Training's computational requirements**
|
50 |
+
|
51 |
+
### **Dataset**
|
52 |
+
|
53 |
+
### **Disclaimer**
|
54 |
+
|
55 |
+
The generated responses from this AI model are purely algorithmic and should be interpreted with caution. The model's outputs may occasionally exhibit bias, offensive language, or potentially harmful content. It is important to note that these responses do not reflect the personal preferences or viewpoints of the authors or the organization of Naseej.
|
56 |
+
|
57 |
+
While every effort is made to mitigate the harmfulness of the model's outputs, it is impossible to guarantee complete elimination of biases or offensive content. The model learns from vast amounts of data and may inadvertently replicate or amplify existing societal biases present in the training data.
|
58 |
+
|
59 |
+
Users are advised to critically evaluate and verify the information provided by the model. Exercise discretion when utilizing the model's responses, particularly in sensitive or controversial topics.
|
60 |
+
|
61 |
+
We are committed to ongoing research and development to improve the model's performance, minimize biases, and reduce harmful outputs. Your feedback and insights are valuable in helping us achieve these goals.
|