Update README.md

bf63a46 verified 7 days ago

4.35 kB

	---
	language:
	- da
	license: llama2
	library_name: transformers
	base_model:
	- NLPnorth/snakmodel-7b-base
	pipeline_tag: text-generation
	---
	![SnakModel Instruct Logo](snakmodel.png)

	## Model Details

	SnakModel is a 7B-parameter model specifically designed for the Danish language. This is the instruction-tuned variant: `SnakModel-7B (instruct)`. Our models build upon [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf), which we continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words, before tuning it on 3.7M Danish instruction-answer pairs.

	Model Developers

	[NLPnorth research unit](https://nlpnorth.github.io) at the [IT University of Copenhagen](https://itu.dk), Denmark.

	Variations

	SnakModel comes as an instruction-tuned, and a base version. In addition, each model includes intermediate checkpoints (under model revisions).

	Input

	Text only, with instructions following the `[INST] {instruction} [/INST]` template.

	Quickstart:

	Here is a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

	```
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "NLPnorth/snakmodel-7b-instruct"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Hvor ligger IT Universitet?"
	messages = [
	{"role": "system", "content": "Du er Snakmodel, skabt af IT-Universitetet i København. Du er en hjælpsom assistent."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=20
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	Output

	Text only.

	Model Architecture

	SnakModel is an auto-regressive, transformer-based language model. The `instruct` version uses supervised fine-tuning (SFT) to enable instruction following in Danish.

	Model Dates

	SnakModel was trained between January 2024 and September 2024.

	License

	This model follows the original [Llama 2 license agreement](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/LICENSE.txt).

	Research Paper

	[Released in Q1 2025]

	## Intended Use & Limitations

	Intended Use Cases

	SnakModel is intended for use in Danish. The instruction-tuned variant is intended for assistant-like chat.

	The `instruct` variant follows the Llama 2 (chat) instruction template, in which instructions are encapsulated in special tokens, i.e., `[INST] {instruction} [/INST]`.

	Limitations

	SnakModel variants are fine-tuned on Danish data. As such, the use in other languages falls out-of-scope. While we found SnakModel to be more proficient in Danish, than other Llama 2-based models, it still frequently generates factually incorrect output. Make sure to carefully evaluate and weigh these factors before deploying the model. In addition, make sure to adhere to the original [Llama 2 license agreement](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/LICENSE.txt).

	## Hardware and Software

	Training Factors

	SnakModel is trained on private infrastructure with one node, containing four NVIDIA A100-PCIe 40GB GPUs. The node has an AMD Epyc 7662 128 Core Processor and 1TB of RAM.

	Carbon Footprint

	Total training time accounted to 8,928 GPU hours, with an average carbon efficiency at 0.122kg CO2eq / kWh. This is equivalent to 272.3kg CO2eq emitted, based on the [Machine Learning Impact calculator](https://mlco2.github.io/impact).

	## Training Data

	Overview

	SnakModel was continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words. The `instruct` version is further tuned on 3.7M Danish instruction-answer pairs.

	[Details to follow in Q1 2025]

	Data Freshness

	The pre-training data has a cutoff of January 2024.

	## Evaluation Results

	[Released in Q1 2025]

	## Citation

	[Released in Q1 2025]