pere commited on
Commit
0f74d31
1 Parent(s): d228916

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +130 -0
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language:
4
+ - no # Generic Norwegian
5
+ - nb # Norwegian Bokmål
6
+ - nn # Norwegian Nynorsk
7
+ - en # English
8
+ - sv # Swedish
9
+ - da # Danish
10
+ tags:
11
+ - norwegian
12
+ - bokmål
13
+ - nynorsk
14
+ - swedish
15
+ - danish
16
+ - multilingual
17
+ - text-generation
18
+ pipeline_tag: text-generation
19
+ license: llama3.1
20
+ ---
21
+
22
+ ## Model Card: NB-Llama-3.1-8B-Instruct
23
+
24
+ ---
25
+
26
+ ### Model Overview
27
+ **NB-Llama-3.1-8B-Instruct** is part of the **NB-Llama-3.1** series of models, trained on top of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). This multilingual generative model was fine-tuned specifically to support Norwegian Bokmål, Norwegian Nynorsk, and English, with partial support for Swedish and Danish.
28
+
29
+ The Instruct-model is trained using Supervised FineTuning (SFT) and then Direct Preference Optimalisation (DPO). The SFT training is based on synthetic datasets, the English [Magpie](https://huggingface.co/Magpie-Align) dataset and a translated/filtered version of this dataset. The DPO training is based on [Anthropics Helpful and Harmless](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. The training is supposed to be fairly basic, giving the models a decent undertstanding of a chat template.
30
+
31
+ The basic idea with this model series was to explore how current state-of-the-art models could be improved for Norwegian by training only on publicly available data. While these models are trained by the National Library of Norway, they do not include data only available through legal deposit. They do, however, contain public data like governmental reports that are both publicly available and legally deposited.
32
+
33
+ ---
34
+
35
+ ### Key Features
36
+
37
+ - **Base Model**: Built on Llama-3.1-8B-Instruct.
38
+ - **Languages**:
39
+ - Full support: Norwegian Bokmål (nb), Norwegian Nynorsk (nn), English (en).
40
+ - Partial support: Swedish (sv), Danish (da).
41
+ - **Purpose**: Supports Norwegian-specific tasks such as question-answering, summarization, and language modeling, while being capable of multilingual generation and translation. Efforts have been made to preserve the English capabilities from the underlying Meta Llama model.
42
+ - **Training Data**: Combines publicly available multilingual datasets with synthetic data generation, focusing on Norwegian, English, Swedish, and Danish sources. Additional details are provided below.
43
+ - **Architecture**: The model uses the Llama 3.1 architecture. It is an auto-regressive language model with an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) for alignment.
44
+
45
+ ---
46
+
47
+ ### Model Details
48
+
49
+ - **Developer**: National Library of Norway (NB-AiLab).
50
+ - **Parameters**: 8 billion.
51
+ - **Knowledge Cutoff**: May 2024.
52
+ - **License**: [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3.1/LICENSE).
53
+
54
+ ---
55
+
56
+ ### Motivation
57
+
58
+ The primary goal of **NB-Llama-3.1-8B-Instruct** is to advance support for Norwegian language technologies and strengthen support for Norwegian Bokmål and Norwegian Nynorsk. Since much knowledge and culture are also expressed in English, Swedish, and Danish, open sources in these languages are included in the training datasets when possible.
59
+
60
+ ---
61
+
62
+ ### Intended Use
63
+
64
+ #### Use Cases
65
+
66
+ - Dialogue systems.
67
+ - General multilingual text generation and language modeling.
68
+ - Norwegian-specific tasks such as:
69
+ - Summarization of texts in Bokmål or Nynorsk.
70
+ - Question-answering tailored to Norwegian cultural and linguistic contexts.
71
+
72
+ #### Out-of-Scope
73
+
74
+ - Use in violation of applicable laws or regulations.
75
+ - Tasks outside the supported languages without additional fine-tuning.
76
+ - High-risk domains without appropriate safety measures.
77
+
78
+ ---
79
+
80
+ ### How to Use
81
+
82
+ Please note tht this is still a research project, and the purpose of releasing the models are to investigate the potential in adapting these models for Norwegian language. The intented use case is experiemental. For end-users, we strongly recommend using the instruction-tuned models. We provide quantized models with close to the same accuracy that will run much faster on most platforms. When fine-tuning the instruction-tuned models, best results are obtained when applying the appropriate templates from Llama 3.1.
83
+
84
+ #### Using `transformers`
85
+
86
+ ```python
87
+ import torch
88
+ from transformers import pipeline
89
+
90
+ model_id = "north/nb-llama-3.1-8B-Instruct"
91
+ pipe = pipeline(
92
+ "text-generation",
93
+ model=model_id,
94
+ torch_dtype=torch.bfloat16,
95
+ device_map="auto",
96
+ )
97
+ messages = [
98
+ {"role": "user", "content": "Hvem er du?"},
99
+ ]
100
+ outputs = pipe(
101
+ messages,
102
+ max_new_tokens=256,
103
+ )
104
+ print(outputs[0]["generated_text"][-1])
105
+
106
+ ```
107
+ ---
108
+
109
+ ### Training Data
110
+
111
+ **Overview:**
112
+
113
+ The training data is based entirely on publicly available datasets and synthetically generated data. A key aspect of the training process was leveraging high-quality knowledge sources in Norwegian, English, Swedish, and Danish.
114
+
115
+ For SFT and DPO the following data were used:
116
+ - [Magpie](https://huggingface.co/Magpie-Align)
117
+ - [Anthropics Helpful and Harmless](https://huggingface.co/datasets/Anthropic/hh-rlhf)
118
+ - Various synthetic and translated datasets
119
+ - See [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) for more details on the pretraining data and data selection
120
+
121
+ ---
122
+
123
+ ### Licensing
124
+
125
+ The model is released under the [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3.1/LICENSE), allowing for research and commercial use within defined limitations. Refer to the [Acceptable Use Policy](https://llama.meta.com/llama3.1/use-policy) for specific restrictions.
126
+
127
+ ---
128
+
129
+ ### Citing & Authors
130
+ The model was trained and documentation written by Per Egil Kummervold.