SGEcon's picture
Update README.md
0e5835a verified
|
raw
history blame
No virus
6.85 kB
metadata
library_name: transformers
license: cc-by-nc-4.0
datasets:
  - kyujinpy/KOR-OpenOrca-Platypus-v3
language:
  - ko
  - en
tags:
  - Economic
  - Finance

Model Details

Model Developers: Sogang University SGEconFinlab(<https://sc.sogang.ac.kr/aifinlab/)

Model Description

This model is a language model specialized in economics and finance. This was learned with various economic/finance-related data. The data sources are listed below, and we are not releasing the data that we trained on because it was used for research/policy purposes. If you wish to use the original data, please contact the original author directly for permission to use it.

Loading the Model

peft_model_id = "SGEcon/KoSOLAR-10.7B-v0.2_fin_v4"
config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0})
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model.eval()

Conducting Conversation

import re

def gen(x):
    inputs = tokenizer(f"### 질문: {x}\n\n### λ‹΅λ³€:", return_tensors='pt', return_token_type_ids=False)

    # 데이터λ₯Ό GPU둜 이동(μ‚¬μš© κ°€λŠ₯ν•œ 경우)
    inputs = {k: v.to(device="cuda" if torch.cuda.is_available() else "cpu") for k, v in inputs.items()}

    gened = model.generate(
        **inputs,
        max_new_tokens=256,  # μƒˆλ‘œ 생성할 ν† ν°μ˜ μ΅œλŒ€ 개수
        early_stopping=True,
        num_return_sequences=1,  # ν•˜λ‚˜μ˜ λ‹΅λ³€λ§Œ 생성
        do_sample=True,  # λ‹€μ–‘ν•œ λ‹΅λ³€ 생성을 μœ„ν•΄ μƒ˜ν”Œλ§ ν™œμ„±ν™”
        eos_token_id=tokenizer.eos_token_id,  # EOS 토큰 ID μ‚¬μš©
        temperature=0.9,  # 생성 λ‹€μ–‘μ„± μ‘°μ ˆμ„ μœ„ν•œ μ˜¨λ„ μ„€μ •
        top_p=0.8,  # nucleus samplingμ—μ„œ μ‚¬μš©ν•  p κ°’
        top_k=50  # top-k samplingμ—μ„œ μ‚¬μš©ν•  k κ°’
    )

    # μƒμ„±λœ μ‹œν€€μŠ€λ₯Ό λ””μ½”λ“œν•˜μ—¬ 좜λ ₯ ν…μŠ€νŠΈλ‘œ λ³€ν™˜
    decoded = tokenizer.decode(gened[0], skip_special_tokens=True).strip()

    # "### λ‹΅λ³€:" λ¬Έμžμ—΄ μ΄ν›„μ˜ ν…μŠ€νŠΈλ§Œ μΆ”μΆœ
    answer_start_idx = decoded.find("### λ‹΅λ³€:") + len("### λ‹΅λ³€:")
    complete_answer = decoded[answer_start_idx:].strip()

    # 첫 번째 ꡬ두점(. ? !)을 μ°Ύμ•„μ„œ κ·Έ λΆ€λΆ„κΉŒμ§€λ§Œ μΆ”μΆœ
    match = re.search(r"[\.\?\!][^\.\?\!]*$", complete_answer)
    if match:
        complete_answer = complete_answer[:match.end()].strip()

    return complete_answer

Training Details

We use QLora to train the base model. Quantized Low Rank Adapters (QLoRA) is an efficient technique that uses 4-bit quantized pre-trained language models to fine-tune 65 billion parameter models on a 48 GB GPU while significantly reducing memory usage. The method uses NormalFloat 4-bit (NF4), a new data type that is theoretically optimal for normally distributed weights; Double Quantization, which further quantizes quantization constants to reduce average memory usage; and Paged Optimizers, which manage memory spikes during mini-batch processing, to increase memory efficiency without sacrificing performance.

Also, we performed instruction tuning using the data that we collected and the kyujinpy/KOR-OpenOrca-Platypus-v3 dataset on the hugging face. Instruction tuning is learning in a supervised learning format that uses instructions and input data together as input and output data as a pair.

Training Data

  1. ν•œκ΅­μ€ν–‰: κ²½μ œκΈˆμœ΅μš©μ–΄ 700μ„ (https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765)
  2. κΈˆμœ΅κ°λ…μ›: κΈˆμœ΅μ†ŒλΉ„μž 정보 포털 파인 κΈˆμœ΅μš©μ–΄μ‚¬μ „(https://fine.fss.or.kr/fine/fnctip/fncDicary/list.do?menuNo=900021)
  3. KDI κ²½μ œμ •λ³΄μ„Όν„°: μ‹œμ‚¬ μš©μ–΄μ‚¬μ „(https://eiec.kdi.re.kr/material/wordDic.do)
  4. ν•œκ΅­κ²½μ œμ‹ λ¬Έ/ν•œκ²½λ‹·μ»΄: ν•œκ²½κ²½μ œμš©μ–΄μ‚¬μ „(https://terms.naver.com/list.naver?cid=42107&categoryId=42107), 였늘의 TESAT(https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=1), 였늘의 μ£Όλ‹ˆμ–΄ TESAT(https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=5), μƒκΈ€μƒκΈ€ν•œκ²½(https://sgsg.hankyung.com/tesat/study)
  5. μ€‘μ†Œλ²€μ²˜κΈ°μ—…λΆ€/λŒ€ν•œλ―Όκ΅­μ •λΆ€: μ€‘μ†Œλ²€μ²˜κΈ°μ—…λΆ€ μ „λ¬Έμš©μ–΄(https://terms.naver.com/list.naver?cid=42103&categoryId=42103)
  6. κ³ μ„±μ‚Ό/λ²•λ¬ΈμΆœνŒμ‚¬: νšŒκ³„Β·μ„Έλ¬΄ μš©μ–΄μ‚¬μ „(https://terms.naver.com/list.naver?cid=51737&categoryId=51737)
  7. 맨큐의 κ²½μ œν•™ 8판 Word Index
  8. kyujinpy/KOR-OpenOrca-Platypus-v3(https://huggingface.co/datasets/kyujinpy/KOR-OpenOrca-Platypus-v3)

The copyright of the data used belongs to the original author, so please contact the original author when using it.

Training Hyperparameters

Hyperparameter SGEcon/KoSOLAR-10.7B-v0.2_fin_v4
Lora Method Lora
load in 4 bit True
learning rate 1e-5
lr scheduler linear
lora alpa 16
lora rank 16
lora dropout 0.05
optim paged_adamw_32bit
target_modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head

Example

μ€‘μ•™μ€ν–‰μ˜ 역할에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ€„λž˜?

쀑앙은행은 ν†΅ν™”λ°œν–‰κΆŒκ³Ό κΈˆμœ΅ν†΅μ œκΆŒμ„ 가진 기관이닀. 쀑앙은행은 κ΅­κ°€μ˜ 톡화정책, μ™Έν™˜μ •μ±…, κΈˆμœ΅μ •μ±…μ„ μˆ˜λ¦½ν•˜λŠ” λ™μ‹œμ— 상업은행과 같은 κΈˆμœ΅κΈ°κ΄€μ„ κ°λ…Β·κ°λ…ν•˜λŠ” 업무λ₯Ό μˆ˜ν–‰ν•œλ‹€. 쀑앙은행은 정뢀와 상업은행에 λŒ€ν•œ μžκΈˆλŒ€λΆ€κΈ°κ΄€μ΄λ‹€. 상업은행은 쀑앙은행에 μžκΈˆμ„ λΉŒλ¦¬κ±°λ‚˜ μ˜ˆκΈˆν•œλ‹€. 쀑앙은행은 ν†΅ν™”μ‹ μš©μ •μ±…μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ κΈˆμœ΅κΈ°κ΄€μ„ 톡해 μžκΈˆμ„ λŒ€μΆœν•˜κ±°λ‚˜ 예금 λ°›λŠ”λ‹€. 쀑앙은행은 상업은행에 λŒ€ν•œ μžκΈˆλŒ€λΆ€κΈ°κ΄€μ˜ μ—­ν• κ³Ό ν•¨κ»˜ μ‹œμ€‘μ€ν–‰μ— λŒ€ν•œ κ°λ…Β·κ°λ…μ˜ 역할을 μˆ˜ν–‰ν•œλ‹€. 상업은행이 μžκΈˆμ„ λŒ€μΆœν•  λ•ŒλŠ” 1차적으둜 상업은행에 λŒ€μΆœκΈˆμ„ μ§€κΈ‰ν•˜λŠ” λŒ€μ‹ , λŒ€μΆœμ€ν–‰μ— λŒ€μΆœκΈˆμ˜ 일뢀 λ˜λŠ” 전앑을 예금으둜 λ°›μ•„ 쀑앙은행에 λˆμ„ 빌렀주고 μ˜ˆκΈˆν•œλ‹€. μ˜ˆκΈˆμ— λŒ€ν•œ μ΄μžμœ¨μ„ λ†’μ—¬ μ˜ˆκΈˆμžκ°€ 쀑앙은행에 μ˜ˆκΈˆμ„ ν•˜κ²Œλ” μœ λ„ν•˜λŠ” 것이닀. ν•œνŽΈ 상업은행은 λŒ€μΆœμ„ ν•  λ•Œ λŒ€μΆœμ€ν–‰μ΄ λŒ€μΆœκΈˆμ„ μ˜ˆκΈˆν•˜λŠ” λŒ€μ‹ , λŒ€μΆœμ„ λ°›λŠ” 은행에 λŒ€μΆœκΈˆμ„ μ§€κΈ‰ν•œλ‹€.