File size: 6,848 Bytes
c100893
 
c6b0748
 
 
 
 
 
 
 
 
c100893
 
 
d7dfada
63bab77
c100893
758d33d
c100893
 
1ffd489
acfc617
 
c100893
758d33d
acfc617
6bb0093
c100893
 
d2b1968
b13807a
0dca21d
 
 
 
 
 
 
 
 
 
 
 
1ffd489
d2b1968
63bab77
0dca21d
c6b0748
0dca21d
 
1ffd489
c6b0748
0dca21d
 
 
 
c6b0748
0dca21d
c6b0748
 
 
 
 
 
0dca21d
1ffd489
c6b0748
 
 
 
 
 
 
 
 
 
 
1ffd489
c6b0748
 
1ffd489
6bb0093
1ffd489
c100893
0e5835a
 
 
 
 
 
 
 
 
c100893
63bab77
c100893
 
758d33d
 
 
 
 
 
 
0e5835a
c100893
 
0e5835a
c100893
 
 
0e5835a
 
c100893
63bab77
6bb0093
63bab77
 
 
 
 
 
 
 
6bb0093
c100893
0e5835a
c100893
0e5835a
c100893
0e5835a
c100893
0e5835a
c100893
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
library_name: transformers
license: cc-by-nc-4.0
datasets:
- kyujinpy/KOR-OpenOrca-Platypus-v3
language:
- ko
- en
tags:
- Economic
- Finance
---


# Model Details
Model Developers: Sogang University SGEconFinlab(<<https://sc.sogang.ac.kr/aifinlab/>)


### Model Description

This model is a language model specialized in economics and finance. This was learned with various economic/finance-related data.
The data sources are listed below, and we are not releasing the data that we trained on because it was used for research/policy purposes. 
If you wish to use the original data, please contact the original author directly for permission to use it.

- **Developed by:** Sogang University SGEconFinlab(<https://sc.sogang.ac.kr/aifinlab/>)
- **License:** cc-by-nc-4.0
- **Base Model:** yanolja/KoSOLAR-10.7B-v0.2(<https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.2>)


## Loading the Model

    peft_model_id = "SGEcon/KoSOLAR-10.7B-v0.2_fin_v4"
    config = PeftConfig.from_pretrained(peft_model_id)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0})
    model = PeftModel.from_pretrained(model, peft_model_id)
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    model.eval()

## Conducting Conversation

    import re

    def gen(x):
        inputs = tokenizer(f"### 질문: {x}\n\n### λ‹΅λ³€:", return_tensors='pt', return_token_type_ids=False)
    
        # 데이터λ₯Ό GPU둜 이동(μ‚¬μš© κ°€λŠ₯ν•œ 경우)
        inputs = {k: v.to(device="cuda" if torch.cuda.is_available() else "cpu") for k, v in inputs.items()}

        gened = model.generate(
            **inputs,
            max_new_tokens=256,  # μƒˆλ‘œ 생성할 ν† ν°μ˜ μ΅œλŒ€ 개수
            early_stopping=True,
            num_return_sequences=1,  # ν•˜λ‚˜μ˜ λ‹΅λ³€λ§Œ 생성
            do_sample=True,  # λ‹€μ–‘ν•œ λ‹΅λ³€ 생성을 μœ„ν•΄ μƒ˜ν”Œλ§ ν™œμ„±ν™”
            eos_token_id=tokenizer.eos_token_id,  # EOS 토큰 ID μ‚¬μš©
            temperature=0.9,  # 생성 λ‹€μ–‘μ„± μ‘°μ ˆμ„ μœ„ν•œ μ˜¨λ„ μ„€μ •
            top_p=0.8,  # nucleus samplingμ—μ„œ μ‚¬μš©ν•  p κ°’
            top_k=50  # top-k samplingμ—μ„œ μ‚¬μš©ν•  k κ°’
        )
    
        # μƒμ„±λœ μ‹œν€€μŠ€λ₯Ό λ””μ½”λ“œν•˜μ—¬ 좜λ ₯ ν…μŠ€νŠΈλ‘œ λ³€ν™˜
        decoded = tokenizer.decode(gened[0], skip_special_tokens=True).strip()

        # "### λ‹΅λ³€:" λ¬Έμžμ—΄ μ΄ν›„μ˜ ν…μŠ€νŠΈλ§Œ μΆ”μΆœ
        answer_start_idx = decoded.find("### λ‹΅λ³€:") + len("### λ‹΅λ³€:")
        complete_answer = decoded[answer_start_idx:].strip()

        # 첫 번째 ꡬ두점(. ? !)을 μ°Ύμ•„μ„œ κ·Έ λΆ€λΆ„κΉŒμ§€λ§Œ μΆ”μΆœ
        match = re.search(r"[\.\?\!][^\.\?\!]*$", complete_answer)
        if match:
            complete_answer = complete_answer[:match.end()].strip()
    
        return complete_answer



    
## Training Details

We use QLora to train the base model.
Quantized Low Rank Adapters (QLoRA) is an efficient technique that uses 4-bit quantized pre-trained language models to fine-tune 65 billion parameter models on a 48 GB GPU while significantly reducing memory usage. 
The method uses NormalFloat 4-bit (NF4), a new data type that is theoretically optimal for normally distributed weights; Double Quantization, which further quantizes quantization constants to reduce average memory usage; and Paged Optimizers, which manage memory spikes during mini-batch processing, to increase memory efficiency without sacrificing performance.

Also, we performed instruction tuning using the data that we collected and the kyujinpy/KOR-OpenOrca-Platypus-v3 dataset on the hugging face. 
Instruction tuning is learning in a supervised learning format that uses instructions and input data together as input and output data as a pair.



 
### Training Data

1. ν•œκ΅­μ€ν–‰: κ²½μ œκΈˆμœ΅μš©μ–΄ 700μ„ (<https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765>)
2. κΈˆμœ΅κ°λ…μ›: κΈˆμœ΅μ†ŒλΉ„μž 정보 포털 파인 κΈˆμœ΅μš©μ–΄μ‚¬μ „(<https://fine.fss.or.kr/fine/fnctip/fncDicary/list.do?menuNo=900021>)
3. KDI κ²½μ œμ •λ³΄μ„Όν„°: μ‹œμ‚¬ μš©μ–΄μ‚¬μ „(<https://eiec.kdi.re.kr/material/wordDic.do>)
4. ν•œκ΅­κ²½μ œμ‹ λ¬Έ/ν•œκ²½λ‹·μ»΄: ν•œκ²½κ²½μ œμš©μ–΄μ‚¬μ „(<https://terms.naver.com/list.naver?cid=42107&categoryId=42107>), 였늘의 TESAT(<https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=1>), 였늘의 μ£Όλ‹ˆμ–΄ TESAT(<https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=5>), μƒκΈ€μƒκΈ€ν•œκ²½(<https://sgsg.hankyung.com/tesat/study>)
5. μ€‘μ†Œλ²€μ²˜κΈ°μ—…λΆ€/λŒ€ν•œλ―Όκ΅­μ •λΆ€: μ€‘μ†Œλ²€μ²˜κΈ°μ—…λΆ€ μ „λ¬Έμš©μ–΄(<https://terms.naver.com/list.naver?cid=42103&categoryId=42103>)
6. κ³ μ„±μ‚Ό/λ²•λ¬ΈμΆœνŒμ‚¬: νšŒκ³„Β·μ„Έλ¬΄ μš©μ–΄μ‚¬μ „(<https://terms.naver.com/list.naver?cid=51737&categoryId=51737>)
7. 맨큐의 κ²½μ œν•™ 8판 Word Index
8. kyujinpy/KOR-OpenOrca-Platypus-v3(<https://huggingface.co/datasets/kyujinpy/KOR-OpenOrca-Platypus-v3>)


The copyright of the data used belongs to the original author, so please contact the original author when using it.




### Training Hyperparameters

|Hyperparameter|SGEcon/KoSOLAR-10.7B-v0.2_fin_v4|
|------|---|
|Lora Method|Lora|
|load in 4 bit|True|
|learning rate|1e-5|
|lr scheduler|linear|
|lora alpa|16|
|lora rank|16|
|lora dropout|0.05|
|optim|paged_adamw_32bit|
|target_modules|q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head|

   

### Example

> μ€‘μ•™μ€ν–‰μ˜ 역할에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ€„λž˜?

>> 쀑앙은행은 ν†΅ν™”λ°œν–‰κΆŒκ³Ό κΈˆμœ΅ν†΅μ œκΆŒμ„ 가진 기관이닀. 쀑앙은행은 κ΅­κ°€μ˜ 톡화정책, μ™Έν™˜μ •μ±…, κΈˆμœ΅μ •μ±…μ„ μˆ˜λ¦½ν•˜λŠ” λ™μ‹œμ— 상업은행과 같은 κΈˆμœ΅κΈ°κ΄€μ„ κ°λ…Β·κ°λ…ν•˜λŠ” 업무λ₯Ό μˆ˜ν–‰ν•œλ‹€. 쀑앙은행은 정뢀와 상업은행에 λŒ€ν•œ μžκΈˆλŒ€λΆ€κΈ°κ΄€μ΄λ‹€. 상업은행은 쀑앙은행에 μžκΈˆμ„ λΉŒλ¦¬κ±°λ‚˜ μ˜ˆκΈˆν•œλ‹€. 쀑앙은행은 ν†΅ν™”μ‹ μš©μ •μ±…μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ κΈˆμœ΅κΈ°κ΄€μ„ 톡해 μžκΈˆμ„ λŒ€μΆœν•˜κ±°λ‚˜ 예금 λ°›λŠ”λ‹€. 쀑앙은행은 상업은행에 λŒ€ν•œ μžκΈˆλŒ€λΆ€κΈ°κ΄€μ˜ μ—­ν• κ³Ό ν•¨κ»˜ μ‹œμ€‘μ€ν–‰μ— λŒ€ν•œ κ°λ…Β·κ°λ…μ˜ 역할을 μˆ˜ν–‰ν•œλ‹€. 상업은행이 μžκΈˆμ„ λŒ€μΆœν•  λ•ŒλŠ” 1차적으둜 상업은행에 λŒ€μΆœκΈˆμ„ μ§€κΈ‰ν•˜λŠ” λŒ€μ‹ , λŒ€μΆœμ€ν–‰μ— λŒ€μΆœκΈˆμ˜ 일뢀 λ˜λŠ” 전앑을 예금으둜 λ°›μ•„ 쀑앙은행에 λˆμ„ 빌렀주고 μ˜ˆκΈˆν•œλ‹€. μ˜ˆκΈˆμ— λŒ€ν•œ μ΄μžμœ¨μ„ λ†’μ—¬ μ˜ˆκΈˆμžκ°€ 쀑앙은행에 μ˜ˆκΈˆμ„ ν•˜κ²Œλ” μœ λ„ν•˜λŠ” 것이닀. ν•œνŽΈ 상업은행은 λŒ€μΆœμ„ ν•  λ•Œ λŒ€μΆœμ€ν–‰μ΄ λŒ€μΆœκΈˆμ„ μ˜ˆκΈˆν•˜λŠ” λŒ€μ‹ , λŒ€μΆœμ„ λ°›λŠ” 은행에 λŒ€μΆœκΈˆμ„ μ§€κΈ‰ν•œλ‹€.