Create README.md

96ed42b verified 7 months ago

5.33 kB



	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- ko
	base_model:
	- meta-llama/Meta-Llama-3-8B
	---

	<a href="https://github.com//KULLM">
	<img src="./bllossom_icon.png" width="40%" height="40%">
	</a>


	# Bllossom \| [Demo](https://c537bba37aaab5fc9e.gradio.live) \| [Homepage](https://www.bllossom.ai/) \|

	The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:

	* Knowledge Linking: Linking Korean and English knowledge through additional training
	* Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
	* Instruction Tuning: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
	* Human Feedback: DPO has been applied
	* Vision-Language Alignment: Aligning the vision transformer with this language model

	This model devel by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)

	## NEWS
	* [2024/04] We released Bllossom v2.0, based on llama-3
	* [2023/12] We released Bllossom-Vision v1.0, based on Bllossom
	* [2023/08] We released Bllossom v1.0, based on llama-2.
	* [2023/07] We released Bllossom v0.7, based on polyglot-ko.


	## Example code
	### Install Dependencies
	```bash
	pip install torch transformers==4.40.0 accelerate
	```

	### Python code with Pipeline
	```python
	import transformers
	import torch

	model_id = "MLP-KTLim/Bllossom"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	pipeline.model.eval()

	PROMPT = '''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.'''
	instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

	messages = [
	{"role": "system", "content": f"{PROMPT}"},
	{"role": "user", "content": f"{instruction}"}
	]

	prompt = pipeline.tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	terminators = [
	pipeline.tokenizer.eos_token_id,
	pipeline.tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = pipeline(
	prompt,
	max_new_tokens=2048,
	eos_token_id=terminators,
	do_sample=True,
	temperature=0.6,
	top_p=0.9,
	)

	print(outputs[0]["generated_text"][len(prompt):])

	# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.
	```

	### Python code with AutoModel
	```python

	import os
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = 'MLP-KTLim/Bllossom'

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	model.eval()

	PROMPT = '''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.'''
	instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

	messages = [
	{"role": "system", "content": f"{PROMPT}"},
	{"role": "user", "content": f"{instruction}"}
	]

	input_ids = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	terminators = [
	tokenizer.eos_token_id,
	tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = model.generate(
	input_ids,
	max_new_tokens=2048,
	eos_token_id=terminators,
	do_sample=True,
	temperature=0.6,
	top_p=0.9,
	repetition_penalty = 1.1
	)

	print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
	# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.
	```



	## Citation
	Language Model
	```text
	@misc{bllossom,
	author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
	title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
	year = {2024},
	journal = {LREC-COLING 2024},
	paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
	},
	}
	```

	Vision-Language Model
	```text
	@misc{bllossom,
	author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
	title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
	year = {2024},
	publisher = {GitHub},
	journal = {NAACL 2024 findings},
	paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
	},
	}
	```

	## Contact
	- 임경태(KyungTae Lim), Professor at Seoultech. `ktlim@seoultech.ac.kr`
	- 함영균(Younggyun Hahm), CEO of Teddysum. `hahmyg@teddysum.ai`