asp-9b-inst-base / README.md

Update README.md

f5137f5 verified 4 months ago

6.86 kB

	---
	library_name: transformers
	license: apache-2.0
	tags:
	- jamba
	- mamba
	- moe
	---

	# Please refrain from using this model yet. It's not any weight at all.

	# A experts weights of [Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1)

	Required Weights for follow-up research.

	The original model is [AI21lab's Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1), which requires an >80GB VRAM. Unfortunately, this almonst was not available via Google Colab or cloud computing services. Thus, attempts were made to perform MoE (Mixture of Experts) splitting, using the following resources as a basis:
	- Original Model: [Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1)
	- MoE Layer Separation: Consult [this script](https://github.com/TechxGenus/Jamba-utils/blob/main/dense_downcycling.py) written by [@TechxGenusand](https://github.com/TechxGenusand) and use [TechxGenus/Jamba-v0.1-9B](https://huggingface.co/TechxGenus/Jamba-v0.1-9B).


	<br><br><br><br><br><br>


	# Original Model Card from [AI21lab's Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1).


	## Usage

	The code used in [AI21lab's Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1).

	### Presequities

	To use Jamba, ensure you have `transformers` version 4.40.0 or higher installed (version 4.39.0 or higher is required):
	```bash
	pip install transformers>=4.40.0
	```

	For optimized Mamba implementations, install `mamba-ssm` and `causal-conv1d`:
	```bash
	pip install mamba-ssm causal-conv1d>=1.2.0
	```
	Ensure the model is on a CUDA device.

	You can run the model without optimized Mamba kernels, but it's not recommended due to significantly lower latencies. To do so, specify `use_mamba_kernels=False` when loading the model.

	### Run the model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("danielpark/asp-9b-inst-base")
	tokenizer = AutoTokenizer.from_pretrained("danielpark/asp-9b-inst-base")

	input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]

	outputs = model.generate(input_ids, max_new_tokens=216)

	print(tokenizer.batch_decode(outputs))
	# ["In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]
	```

	When using `transformers<4.40.0`, ensure `trust_remote_code=True` for running the new Jamba architecture.

	<details>
	<summary><strong>Loading the model in half precision</strong></summary>

	The published checkpoint is saved in BF16. To load it into RAM in BF16/FP16, specify `torch_dtype`:

	```python
	from transformers import AutoModelForCausalLM
	import torch
	model = AutoModelForCausalLM.from_pretrained("danielpark/asp-9b-inst-base",
	torch_dtype=torch.bfloat16) # you can also use torch_dtype=torch.float16
	```

	When using half precision, enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. To use it, ensure the model is on a CUDA device. Since the model is too big to fit on a single 80GB GPU, parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):
	```python
	from transformers import AutoModelForCausalLM
	import torch
	model = AutoModelForCausalLM.from_pretrained("danielpark/asp-9b-inst-base",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto")
	```

	</details>
	<details><summary><strong>Load the model in 8-bit</strong></summary>

	Using 8-bit precision, up to 140K sequence lengths can fit on a single 80GB GPU. Quantize the model to 8-bit using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index). To exclude Mamba blocks from quantization to prevent model quality degradation:

	```python
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig
	quantization_config = BitsAndBytesConfig(load_in_8bit=True,
	llm_int8_skip_modules=["mamba"])
	model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	quantization_config=quantization_config)
	```
	</details>

	### Fine-tuning example

	Jamba is a base model that can be fine-tuned for custom solutions (including for chat/instruct versions). Fine-tune it using any technique of your choice. Here's an example of fine-tuning with the [PEFT](https://huggingface.co/docs/peft/index) library:

	```python
	from datasets import load_dataset
	from trl import SFTTrainer
	from peft import LoraConfig
	from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments

	tokenizer = AutoTokenizer.from_pretrained("danielpark/asp-9b-inst-base")
	model = AutoModelForCausalLM.from_pretrained("danielpark/asp-9b-inst-base", device_map='auto')

	dataset = load_dataset("Abirate/english_quotes", split="train")
	training_args = TrainingArguments(
	output_dir="./results",
	num_train_epochs=3,
	per_device_train_batch_size=4,
	logging_dir='./logs',
	logging_steps=10,
	learning_rate=2e-3
	)
	lora_config = LoraConfig(
	r=8,
	target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
	task_type="CAUSAL_LM",
	bias="none"
	)
	trainer = SFTTrainer(
	model=model,
	tokenizer=tokenizer,
	args=training_args,
	peft_config=lora_config,
	train_dataset=dataset,
	dataset_text_field="quote",
	)

	trainer.train()
	```


	## Further
	Check [ai21labs/Jamba-tiny-random](https://huggingface.co/ai21labs/Jamba-tiny-random), which has 128M parameters (instead of 52B), and is initialized with random weights and did not undergo any training.