Update README.md
Browse files
README.md
CHANGED
@@ -92,7 +92,7 @@ outputs = model.generate(input_ids=input_ids, max_length=500, num_return_sequenc
|
|
92 |
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
93 |
print(decoded_output) # 결과 출력
|
94 |
|
95 |
-
Markdown(decoded_output.split("
|
96 |
```
|
97 |
|
98 |
```python
|
@@ -102,11 +102,13 @@ Markdown(decoded_output.split("엄마:")[1])
|
|
102 |
### Training Details
|
103 |
|
104 |
#### Training Data
|
105 |
-
Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
|
106 |
-
|
|
|
|
|
107 |
|
108 |
#### Training Procedure
|
109 |
-
- **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data.
|
110 |
- **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
|
111 |
- **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.
|
112 |
|
|
|
92 |
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
93 |
print(decoded_output) # 결과 출력
|
94 |
|
95 |
+
Markdown(decoded_output.split("AI:")[1])
|
96 |
```
|
97 |
|
98 |
```python
|
|
|
102 |
### Training Details
|
103 |
|
104 |
#### Training Data
|
105 |
+
Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
|
106 |
+
The data we used is as follows.
|
107 |
+
- https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=543#:~:text=%EC%86%8C%EA%B0%9C.%20%EC%8B%9D%EC%9D%8C%EB%A3%8C,%20%EC%A3%BC%EA%B1%B0%EC%99%80%20%EC%83%9D%ED%99%9C,%20%EA%B5%90%ED%86%B5,%20%EA%B5%90%EC%9C%A1,%20%EA%B0%80%EC%A1%B1%20%EB%93%B1%2020%EC%97%AC%EA%B0%9C%20%EC%A3%BC%EC%A0%9C%EC%97%90%20%EB%8C%80%ED%95%9C%20%EC%9E%90%EC%9C%A0%EB%A1%9C%EC%9A%B4%20%EC%9D%BC%EC%83%81%EB%8C%80%ED%99%94,%EB%82%98%ED%83%80%EB%82%98%EB%8A%94%20%EB%AC%B8%EC%9E%A5
|
108 |
+
- https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71694
|
109 |
|
110 |
#### Training Procedure
|
111 |
+
- **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data. To implement the persona of the service, the speaker's gender and age were specified during the data preprocessing phase. In the "Korean SNS Multi-turn Conversation Data," words like "레게노," which are used primarily on social media and rarely in actual spoken language, were removed.
|
112 |
- **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
|
113 |
- **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.
|
114 |
|