README.md · danielpark/ko-llama-2-jindo-13b-instruct at main

ko-llama-2-jindo-13b-instruct / README.md

danielpark

Update README.md

faa4dbf over 1 year ago

preview code

raw

history blame contribute delete

15.7 kB

	---
	library_name: peft

	datasets:
	- korean-jindo-dataset.json
	# - sciq
	# - metaeval/ScienceQA_text_only
	# - GAIR/lima
	# - Open-Orca/OpenOrca
	# - openbookqa
	language:
	- en
	- ko
	tags:
	- dsdanielpark
	- llama2
	- instruct
	- instruction
	- jindo
	- korean
	- translation
	- 13b
	pipeline_tag: text-generation
	---

	# Since this model is still under development, I recommend not using it until it reaches the development stage 5.

	Development Status :: 2 - Pre-Alpha <br>
	Developed by MinWoo Park, 2023, Seoul, South Korea. [Contact: parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com).
	[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhuggingface.co%2Fdanielpark%2Fko-llama-2-jindo-13b-instruct&count_bg=%23000000&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=views&edge_flat=false)](https://hits.seeyoufarm.com)

	# danielpark/llama2-jindo-13b-instruct model card
	## `Jindo` is sLLM for construct datasets for LLM `KOLANI`.
	> Warning The training is still in progress.

	This model is an LLM in various language domains, including Korean translation and correction.
	Its main purpose is to create a dataset for training the Korean LLM "KOLANI" (which is still undergoing training).
	Furthermore, since this model has been developed solely by one individual without any external support, the release and improvement process might be relatively slow.
	Jindo is implemented as sLLM for lightweight purposes, thus focusing primarily on the 7B version for optimization, alongside the 13B version.
	Using: [QLoRA](https://github.com/artidoro/qlora)

	## Model Details
	The weights you are currently viewing are preliminary checkpoints, and the official weights have not been released yet.

	* Developed by: [Minwoo Park](https://github.com/dsdanielpark)
	* Backbone Model: [LLaMA2](https://huggingface.co/meta-llama/Llama-2-7b) [[Paper](https://huggingface.co/papers/2307.09288)]
	* Model Jindo Variations: jindo-instruct
	* jindo-instruct Variations: 2b / 7b / 13b
	* [danielpark/ko-llama-2-jindo-2b-instruct]() (from LLaMA1)
	* [danielpark/ko-llama-2-jindo-7b-instruct](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct) (from LLaMA2)
	* [danielpark/ko-llama-2-jindo-13b-instruct](https://huggingface.co/danielpark/ko-llama-2-jindo-13b-instruct) (from LLaMA2)
	* This model targets specific domains, so the 70b model will not be released.
	* Quantinized Weight: 7b-gptq(4bit-128g)
	* [ko-llama-2-jindo-7b-instruct-4bit-128g-gptq](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq)
	* Library: [HuggingFace Transformers](https://github.com/huggingface/transformers)
	* License: This model is licensed under the Meta's [LLaMA2 license](https://github.com/facebookresearch/llama/blob/main/LICENSE). We plan to check the dataset's license along with the official release, but our primary goal is to aim for a commercial-use release by default.
	* Where to send comments: Instructions on how to provide feedback or comments on a model can be found by opening an issue in the [Hugging Face community's model repository](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct)
	* Contact: For questions and comments about the model, please email to me [parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com)



	## Web Demo
	I implement the web demo using several popular tools that allow us to rapidly create web UIs.
	\| model \| web ui \| quantinized \|
	\| --- \| --- \| --- \|
	\| danielpark/ko-llama-2-jindo-7b-instruct. \| using [gradio](https://github.com/dsdanielpark/gradio) on [colab](https://colab.research.google.com/drive/1zwR7rz6Ym53tofCGwZZU8y5K_t1r1qqo#scrollTo=p2xw_g80xMsD) \| - \|
	\| danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq \| using [text-generation-webui](https://github.com/oobabooga/text-generation-webui) on [colab](https://colab.research.google.com/drive/19ihYHsyg_5QFZ_A28uZNR_Z68E_09L4G) \| gptq \|
	\| danielpark/ko-llama-2-jindo-7b-instruct-ggml \| [koboldcpp-v1.38](https://github.com/LostRuins/koboldcpp/releases/tag/v1.38) \| ggml \|



	## Dataset Details

	### Used Datasets
	- korean-jindo-dataset
	- The dataset has not been released yet

	> No other data was used except for the dataset mentioned above


	### Prompt Template

	```
	### System:
	{System}

	### User:
	{User}

	### Assistant:
	{Assistant}
	```




	## Hardware and Software

	* Hardware
	* Under 10b model: Trained using the free T4 GPU resource.
	* Over 10b model: Utilized a Single A100 on Google Colab.

	* Training Factors: [HuggingFace trainer](https://huggingface.co/docs/transformers/main_classes/trainer)

	## Evaluation Results
	Please refer to the following procedure for the evaluation of the backbone model. Other benchmarking and qualitative evaluations for Korean datasets are still pending.

	### Overview
	- We conducted a performance evaluation based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
	We evaluated our model on four benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU`, and `TruthfulQA`.
	We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness), specifically commit [b281b0921b636bc36ad05c0b0b0763bd6dd43463](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463).



	## Usage
	Please refer to the following information and install the appropriate versions compatible with your enviroments.
	```
	$ pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
	```
	```python
	from transformers import AutoTokenizer
	import transformers
	import torch

	model = "danielpark/ko-llama-2-jindo-13b-instruct"
	# model = "meta-llama/Llama-2-13b-hf"

	tokenizer = AutoTokenizer.from_pretrained(model)
	pipeline = transformers.pipeline(
	"text-generation",
	model=model,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	sequences = pipeline(
	'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
	do_sample=True,
	top_k=10,
	num_return_sequences=1,
	eos_token_id=tokenizer.eos_token_id,
	max_length=200,
	)
	for seq in sequences:
	print(f"Result: {seq['generated_text']}")
	```


	To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` and `accelerate` libraries installed.

	```python
	%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
	```

	The instruction following pipeline can be loaded using the `pipeline` function as shown below. This loads a custom `InstructionTextGenerationPipeline`
	found in the model repo [here](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py), which is why `trust_remote_code=True` is required.
	Including `torch_dtype=torch.bfloat16` is generally recommended if this type is supported in order to reduce memory usage. It does not appear to impact output quality.
	It is also fine to remove it if there is sufficient memory.

	```python
	import torch
	from transformers import pipeline

	generate_text = pipeline(model="danielpark/ko-llama-2-jindo-13b-instruct", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
	```

	You can then use the pipeline to answer instructions:
	```python
	res = generate_text("Explain to me the difference between nuclear fission and fusion.")
	print(res[0]["generated_text"])
	```
	Alternatively, if you prefer to not use `trust_remote_code=True` you can download [instruct_pipeline.py](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py),
	store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:

	```python
	import torch
	from instruct_pipeline import InstructionTextGenerationPipeline
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct", padding_side="left")
	model = AutoModelForCausalLM.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct", device_map="auto", torch_dtype=torch.bfloat16)

	generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
	```

	### LangChain Usage

	To use the pipeline with LangChain, you must set `return_full_text=True`, as LangChain expects the full text to be returned
	and the default for the pipeline is to only return the new text.

	```python
	import torch
	from transformers import pipeline

	generate_text = pipeline(model="danielpark/ko-llama-2-jindo-7b-instruct", torch_dtype=torch.bfloat16,
	trust_remote_code=True, device_map="auto", return_full_text=True)
	```

	You can create a prompt that either has only an instruction or has an instruction with context:

	```python
	from langchain import PromptTemplate, LLMChain
	from langchain.llms import HuggingFacePipeline

	# template for an instrution with no input
	prompt = PromptTemplate(
	input_variables=["instruction"],
	template="{instruction}")

	# template for an instruction with input
	prompt_with_context = PromptTemplate(
	input_variables=["instruction", "context"],
	template="{instruction}\n\nInput:\n{context}")

	hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

	llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
	llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
	```

	Example predicting using a simple instruction:

	```python
	print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
	```

	Example predicting using an instruction with context:

	```python
	context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
	and Founding Father who served as the first president of the United States from 1789 to 1797."""

	print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
	```



	### Scripts
	- Prepare evaluation environments:
	```
	# clone the repository
	git clone https://github.com/EleutherAI/lm-evaluation-harness.git

	# check out the specific commit
	git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463

	# change to the repository directory
	cd lm-evaluation-harness
	```

	## Ethical Issues
	The Jindo model has not been filtered for harmful, biased, or explicit content. As a result, outputs that do not adhere to ethical norms may be generated during use. Please exercise caution when using the model in research or practical applications.

	### Ethical Considerations
	- There were no ethical issues involved, as we did not include the benchmark test set or the training set in the model's training process.
	As always, we encourage responsible and ethical use of this model. Please note that while Guanaco strives to provide accurate and helpful responses, it is still crucial to cross-verify the information from reliable sources for knowledge-based queries.

	## Contact Me
	To contact me, you can [mail to me. parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com).

	## Model Architecture

	```python
	LlamaForCausalLM(
	(model): LlamaModel(
	(embed_tokens): Embedding(32000, 4096, padding_idx=0)
	(layers): ModuleList(
	(0-31): 32 x LlamaDecoderLayer(
	(self_attn): LlamaAttention(
	(q_proj): Linear4bit(
	in_features=4096, out_features=4096, bias=False
	(lora_dropout): ModuleDict(
	(default): Dropout(p=0.1, inplace=False)
	)
	(lora_A): ModuleDict(
	(default): Linear(in_features=4096, out_features=64, bias=False)
	)
	(lora_B): ModuleDict(
	(default): Linear(in_features=64, out_features=4096, bias=False)
	)
	(lora_embedding_A): ParameterDict()
	(lora_embedding_B): ParameterDict()
	)
	(k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
	(v_proj): Linear4bit(
	in_features=4096, out_features=4096, bias=False
	(lora_dropout): ModuleDict(
	(default): Dropout(p=0.1, inplace=False)
	)
	(lora_A): ModuleDict(
	(default): Linear(in_features=4096, out_features=64, bias=False)
	)
	(lora_B): ModuleDict(
	(default): Linear(in_features=64, out_features=4096, bias=False)
	)
	(lora_embedding_A): ParameterDict()
	(lora_embedding_B): ParameterDict()
	)
	(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
	(rotary_emb): LlamaRotaryEmbedding()
	)
	(mlp): LlamaMLP(
	(gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
	(up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
	(down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
	(act_fn): SiLUActivation()
	)
	(input_layernorm): LlamaRMSNorm()
	(post_attention_layernorm): LlamaRMSNorm()
	)
	)
	(norm): LlamaRMSNorm()
	)
	(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
	)
	```

	## Training procedure
	The following `bitsandbytes` quantization config was used during training:
	- load_in_8bit: False
	- load_in_4bit: True
	- llm_int8_threshold: 6.0
	- llm_int8_skip_modules: None
	- llm_int8_enable_fp32_cpu_offload: False
	- llm_int8_has_fp16_weight: False
	- bnb_4bit_quant_type: nf4
	- bnb_4bit_use_double_quant: False
	- bnb_4bit_compute_dtype: bfloat16


	### Framework versions
	- PEFT 0.4.0


	### License:
	The licenses of the pretrained models, llama1, and llama2, along with the datasets used, are applicable. For other datasets related to this work, guidance will be provided in the official release. The responsibility for verifying all licenses lies with the user, and the developer assumes no liability, explicit or implied, including legal responsibilities.


	### Remark:
	- The "instruct" in the model name can be omitted, but it is used to differentiate between the backbones of llama2 for chat and general purposes. Additionally, this model is created for a specific purpose, so we plan to fine-tune it with a dataset focused on instructions.



	### Naive Cost Estimation

	Assuming linearity without considering various variables.

	- 1000 simple prompts
	- 20 minutes of processing time
	- Approximately $2 in cost (based on Google Colab's 100 computing units and a single A100 GPU estimated at $10).
	- Required CPU RAM: 5GB (depending on the training data and dummy size)
	- Required VRAM: 12-13GB

	Time ($t$) and cost ($c$) are calculated based on the given information as follows:

	\[ t(n) = \frac{{20 \text{ minutes}}}{{1000 \text{ prompts}}} \times n \]
	\[ c(n) = \frac{{2 \text{ dollars}}}{{1000 \text{ prompts}}} \times n \]

	```python
	def calculate_time_cost(num_prompts):
	total_prompts = 1000
	total_time_minutes = 20
	total_cost_dollars = 2

	time_required = (total_time_minutes / total_prompts) * num_prompts
	cost_required = (total_cost_dollars / total_prompts) * num_prompts

	return time_required, cost_required

	# Example
	num_prompts = n
	time, cost = calculate_time_cost(num_prompts)
	print(f"Time for {num_prompts}: {time} minutes")
	print(f"Cost for {num_prompts}: {cost} dollar")
	```


	### Chinchilla scaling laws
	The Chinchilla scaling laws focus on optimally scaling training compute but often we also care about inference cost. This tool follows [Harm de Vries’ blog post](https://www.harmdevries.com/post/model-size-vs-compute-overhead/) and visualizes the tradeoff between training comput and inference cost (i.e. model size).