bodam commited on
Commit
850d746
·
verified ·
1 Parent(s): e6110a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -92,7 +92,7 @@ outputs = model.generate(input_ids=input_ids, max_length=500, num_return_sequenc
92
  decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
93
  print(decoded_output) # 결과 출력
94
 
95
- Markdown(decoded_output.split("엄마:")[1])
96
  ```
97
 
98
  ```python
@@ -102,11 +102,13 @@ Markdown(decoded_output.split("엄마:")[1])
102
  ### Training Details
103
 
104
  #### Training Data
105
- Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
106
- To implement the persona of the service, the speaker's gender and age were specified during the data preprocessing phase. In the "Korean SNS Multi-turn Conversation Data," words like "레게노," which are used primarily on social media and rarely in actual spoken language, were removed.
 
 
107
 
108
  #### Training Procedure
109
- - **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data.
110
  - **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
111
  - **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.
112
 
 
92
  decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
93
  print(decoded_output) # 결과 출력
94
 
95
+ Markdown(decoded_output.split("AI:")[1])
96
  ```
97
 
98
  ```python
 
102
  ### Training Details
103
 
104
  #### Training Data
105
+ Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
106
+ The data we used is as follows.
107
+ - https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=543#:~:text=%EC%86%8C%EA%B0%9C.%20%EC%8B%9D%EC%9D%8C%EB%A3%8C,%20%EC%A3%BC%EA%B1%B0%EC%99%80%20%EC%83%9D%ED%99%9C,%20%EA%B5%90%ED%86%B5,%20%EA%B5%90%EC%9C%A1,%20%EA%B0%80%EC%A1%B1%20%EB%93%B1%2020%EC%97%AC%EA%B0%9C%20%EC%A3%BC%EC%A0%9C%EC%97%90%20%EB%8C%80%ED%95%9C%20%EC%9E%90%EC%9C%A0%EB%A1%9C%EC%9A%B4%20%EC%9D%BC%EC%83%81%EB%8C%80%ED%99%94,%EB%82%98%ED%83%80%EB%82%98%EB%8A%94%20%EB%AC%B8%EC%9E%A5
108
+ - https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71694
109
 
110
  #### Training Procedure
111
+ - **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data. To implement the persona of the service, the speaker's gender and age were specified during the data preprocessing phase. In the "Korean SNS Multi-turn Conversation Data," words like "레게노," which are used primarily on social media and rarely in actual spoken language, were removed.
112
  - **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
113
  - **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.
114