SudaGom
/

SudaGom-ko-gemma2-9b-it

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

bodam commited on Oct 3, 2024

Commit

850d746

·

verified ·

1 Parent(s): e6110a6

Update README.md

Files changed (1) hide show

README.md +6 -4

README.md CHANGED Viewed

@@ -92,7 +92,7 @@ outputs = model.generate(input_ids=input_ids, max_length=500, num_return_sequenc
 decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(decoded_output)  # 결과 출력
-Markdown(decoded_output.split("엄마:")[1])
 ```
 ```python
@@ -102,11 +102,13 @@ Markdown(decoded_output.split("엄마:")[1])
 ### Training Details
 #### Training Data
-Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
-To implement the persona of the service, the speaker's gender and age were specified during the data preprocessing phase. In the "Korean SNS Multi-turn Conversation Data," words like "레게노," which are used primarily on social media and rarely in actual spoken language, were removed.
 #### Training Procedure
-- **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data.
 - **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
 - **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.

 decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(decoded_output)  # 결과 출력
+Markdown(decoded_output.split("AI:")[1])
 ```
 ```python
 ### Training Details
 #### Training Data
+Data includes children's conversation datasets, anonymized and classified by developmental stages, ensuring a diverse and representative sample.
+The data we used is as follows.
+- https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=543#:~:text=%EC%86%8C%EA%B0%9C.%20%EC%8B%9D%EC%9D%8C%EB%A3%8C,%20%EC%A3%BC%EA%B1%B0%EC%99%80%20%EC%83%9D%ED%99%9C,%20%EA%B5%90%ED%86%B5,%20%EA%B5%90%EC%9C%A1,%20%EA%B0%80%EC%A1%B1%20%EB%93%B1%2020%EC%97%AC%EA%B0%9C%20%EC%A3%BC%EC%A0%9C%EC%97%90%20%EB%8C%80%ED%95%9C%20%EC%9E%90%EC%9C%A0%EB%A1%9C%EC%9A%B4%20%EC%9D%BC%EC%83%81%EB%8C%80%ED%99%94,%EB%82%98%ED%83%80%EB%82%98%EB%8A%94%20%EB%AC%B8%EC%9E%A5
+- https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71694
 #### Training Procedure
+- **Preprocessing**: Text data was cleaned and formatted to remove any inappropriate content and personal data. To implement the persona of the service, the speaker's gender and age were specified during the data preprocessing phase. In the "Korean SNS Multi-turn Conversation Data," words like "레게노," which are used primarily on social media and rarely in actual spoken language, were removed.
 - **Model Fine-tuning**: Conducted on the cleaned dataset to tailor the model's responses to children's linguistic needs.
 - **Reinforcement Learning**: Implemented to refine the flow and appropriateness of conversations.