Dongjin-kr commited on
Commit
ea5de0e
ยท
1 Parent(s): 6290f79

Updated ReadMe

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md CHANGED
@@ -1,3 +1,116 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ko
5
+ - en
6
+ pipeline_tag: text-classification
7
  ---
8
+ # Model Card for Model ID
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+ This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
13
+
14
+ <h1 align="center">Korean Reranker on AWS</h1>
15
+ <p align="center">
16
+ <a href="https://github.com/aws-samples">
17
+ <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
18
+ </a>
19
+ <a href="https://github.com/aws-samples/aws-ai-ml-workshop-kr/blob/master/LICENSE">
20
+ <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
21
+ </a>
22
+ <!-- <a href="https://huggingface.co/C-MTEB">
23
+ <img alt="Build" src="https://img.shields.io/badge/C_MTEB-๐Ÿค—-yellow">
24
+ </a> -->
25
+ <a href="https://github.com/aws-samples/aws-ai-ml-workshop-kr/tree/master/genai/aws-gen-ai-kr/30_fine_tune/reranker-kr">
26
+ <img alt="Build" src="https://img.shields.io/badge/KoReranker-1.0-red">
27
+ </a>
28
+ </p>
29
+
30
+ ### **ํ•œ๊ตญ์–ด Reranker** ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ํŒŒ์ธํŠœ๋‹ ๊ฐ€์ด๋“œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
31
+ ko-reranker๋Š” [BAAI/bge-reranker-larger](https://huggingface.co/BAAI/bge-reranker-large) ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ fine-tuned model ์ž…๋‹ˆ๋‹ค.
32
+
33
+ - - -
34
+
35
+ ## 0. Usage
36
+ - #### <span style="#FF69B4;"> Reranker๋Š” ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ ์งˆ๋ฌธ๊ณผ ๋ฌธ์„œ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ ์ž„๋ฒ ๋”ฉ ๋Œ€์‹  ์œ ์‚ฌ๋„๋ฅผ ์ง์ ‘ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.</span>
37
+ - #### <span style="#FF69B4;"> Reranker์— ์งˆ๋ฌธ๊ณผ ๊ตฌ์ ˆ์„ ์ž…๋ ฅํ•˜๋ฉด ์—ฐ๊ด€์„ฑ ์ ์ˆ˜๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.</span>
38
+ - #### <span style="#FF69B4;"> Reranker๋Š” CrossEntropy loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ ํ™”๋˜๋ฏ€๋กœ ๊ด€๋ จ์„ฑ ์ ์ˆ˜๊ฐ€ ํŠน์ • ๋ฒ”์œ„์— ๊ตญํ•œ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.</span>
39
+
40
+ ## 1. Backgound
41
+ - #### <span style="#FF69B4;"> **์ปจํƒ์ŠคํŠธ ์ˆœ์„œ๊ฐ€ ์ •ํ™•๋„์— ์˜ํ–ฅ ์ค€๋‹ค**([Lost in Middel, *Liu et al., 2023*](https://arxiv.org/pdf/2307.03172.pdf)) </span>
42
+
43
+ - #### <span style="#FF69B4;"> [Reranker ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ์ด์œ ](https://www.pinecone.io/learn/series/rag/rerankers/)</span>
44
+ - ํ˜„์žฌ LLM์€ context ๋งŽ์ด ๋„ฃ๋Š”๋‹ค๊ณ  ์ข‹์€๊ฑฐ ์•„๋‹˜, relevantํ•œ๊ฒŒ ์ƒ์œ„์— ์žˆ์–ด์•ผ ์ •๋‹ต์„ ์ž˜ ๋งํ•ด์ค€๋‹ค
45
+ - Semantic search์—์„œ ์‚ฌ์šฉํ•˜๋Š” similarity(relevant) score๊ฐ€ ์ •๊ตํ•˜์ง€ ์•Š๋‹ค. (์ฆ‰, ์ƒ์œ„ ๋žญ์ปค๋ฉด ํ•˜์œ„ ๋žญ์ปค๋ณด๋‹ค ํ•ญ์ƒ ๋” ์งˆ๋ฌธ์— ์œ ์‚ฌํ•œ ์ •๋ณด๊ฐ€ ๋งž์•„?)
46
+ * Embedding์€ meaning behind document๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์— ํŠนํ™”๋˜์–ด ์žˆ๋‹ค.
47
+ * ์งˆ๋ฌธ๊ณผ ์ •๋‹ต์ด ์˜๋ฏธ์ƒ ๊ฐ™์€๊ฑด ์•„๋‹ˆ๋‹ค. ([Hypothetical Document Embeddings](https://medium.com/prompt-engineering/hyde-revolutionising-search-with-hypothetical-document-embeddings-3474df795af8))
48
+ * ANNs([Approximate Nearest Neighbors](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6)) ์‚ฌ์šฉ์— ๋”ฐ๋ฅธ ํŒจ๋„ํ‹ฐ
49
+
50
+ - - -
51
+
52
+ ## 2. Reranker models
53
+
54
+ - #### <span style="#FF69B4;"> [Cohere] [Reranker](https://txt.cohere.com/rerank/)</span>
55
+ - #### <span style="#FF69B4;"> [BAAI] [bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large)</span>
56
+ - #### <span style="#FF69B4;"> [BAAI] [bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base)</span>
57
+
58
+ - - -
59
+
60
+ ## 3. Dataset
61
+
62
+ - #### <span style="#FF69B4;"> [msmarco-triplets](https://github.com/microsoft/MSMARCO-Passage-Ranking) </span>
63
+ - (Question, Answer, Negative)-Triplets from MS MARCO Passages dataset, 499,184 samples
64
+ - ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์…‹์€ ์˜๋ฌธ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
65
+ - Amazon Translate ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒˆ์—ญํ•˜์—ฌ ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
66
+
67
+ - - -
68
+
69
+ ## 4. Performance
70
+ | Model | has-right-in-contexts | mrr (mean reciprocal rank) |
71
+ |:---------------------------|:-----------------:|:--------------------------:|
72
+ | without-reranker (default)| 0.93 | 0.80 |
73
+ | with-reranker (bge-reranker-large)| 0.95 | 0.84 |
74
+ | **with-reranker (fine-tuned using korean)** | **0.96** | **0.87** |
75
+
76
+ - **evaluation set**:
77
+ ```code
78
+ ./dataset/evaluation/eval_dataset.csv
79
+ ```
80
+ - **training parameters**:
81
+
82
+ ```json
83
+ {
84
+ "learning_rate": 5e-6,
85
+ "fp16": True,
86
+ "num_train_epochs": 3,
87
+ "per_device_train_batch_size": 1,
88
+ "gradient_accumulation_steps": 32,
89
+ "train_group_size": 3,
90
+ "max_len": 512,
91
+ "weight_decay": 0.01,
92
+ }
93
+ ```
94
+
95
+ - - -
96
+
97
+ ## 5. Acknowledgement
98
+ - <span style="#FF69B4;"> Part of the code is developed based on [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/tree/master?tab=readme-ov-file) and [KoSimCSE-SageMaker](https://github.com/daekeun-ml/KoSimCSE-SageMaker/tree/7de6eefef8f1a646c664d0888319d17480a3ebe5).</span>
99
+
100
+ - - -
101
+
102
+ ## 6. Citation
103
+ - <span style="#FF69B4;"> If you find this repository useful, please consider giving a star โญ and citation</span>
104
+
105
+ - - -
106
+
107
+ ## 7. Contributors:
108
+ - <span style="#FF69B4;"> **Dongjin Jang, Ph.D.** (AWS AI/ML Specislist Solutions Architect) | [Mail](mailto:dongjinj@amazon.com) | [Linkedin](https://www.linkedin.com/in/dongjin-jang-kr/) | [Git](https://github.com/dongjin-ml) | </span>
109
+
110
+ - - -
111
+
112
+ ## 8. License
113
+ - <span style="#FF69B4;"> FlagEmbedding is licensed under the [MIT License](https://github.com/aws-samples/aws-ai-ml-workshop-kr/blob/master/LICENSE). </span>
114
+
115
+
116
+