jaeyong2 commited on
Commit
1182357
·
verified ·
1 Parent(s): 70feb2b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - th
5
+ base_model:
6
+ - BAAI/bge-m3
7
+ ---
8
+
9
+
10
+
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+
20
+ ## Train
21
+
22
+ - H/W : colab A100 40GB
23
+ - Data : jaeyong2/Thai-emb-PreView (step : 70729)
24
+
25
+ ```
26
+ !torchrun --nproc_per_node 1 \
27
+ -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
28
+ --output_dir "/content/drive/My Drive/bge_thai" \
29
+ --model_name_or_path BAAI/bge-m3 \
30
+ --train_data ./train.jsonl \
31
+ --learning_rate 1e-5 \
32
+ --bf16 \
33
+ --num_train_epochs 1 \
34
+ --per_device_train_batch_size 1 \
35
+ --dataloader_drop_last True \
36
+ --temperature 0.02 \
37
+ --query_max_len 2048 \
38
+ --passage_max_len 512 \
39
+ --train_group_size 2 \
40
+ --negatives_cross_device \
41
+ --logging_steps 10 \
42
+ --save_steps 1000 \
43
+ --query_instruction_for_retrieval ""
44
+ ```
45
+
46
+ ## Evaluation
47
+
48
+ Code :
49
+ ```
50
+ import torch
51
+ import numpy as np
52
+ from sklearn.metrics import pairwise_distances
53
+ from tqdm import tqdm
54
+ import datasets
55
+ def get_embedding(text, model):
56
+ with torch.no_grad():
57
+ embedding = model.encode(text)['dense_vecs']
58
+ return embedding
59
+
60
+
61
+ dataset = datasets.load_dataset("jaeyong2/Thai-emb-PreView")
62
+ validation_dataset = dataset["test"].select(range((1000)))
63
+
64
+
65
+ def evaluate(validation_dataset):
66
+ correct_count = 0
67
+
68
+ for item in tqdm(validation_dataset):
69
+ query_embedding = get_embedding(item["context"], fine_tuned_model)
70
+ document_embedding = get_embedding(item["Title"], fine_tuned_model)
71
+ negative_embedding = get_embedding(item["Fake Title"], fine_tuned_model)
72
+
73
+
74
+ # 쿼리와 모든 문서 간의 유사도 계산 (코사인 거리 사용)
75
+ positive_distances = pairwise_distances(query_embedding.reshape(1, -1), document_embedding.reshape(1, -1), metric="cosine")
76
+ negative_distances = pairwise_distances(query_embedding.reshape(1, -1), negative_embedding.reshape(1, -1), metric="cosine")
77
+
78
+ if positive_distances < negative_distances:
79
+ correct_count += 1
80
+
81
+ accuracy = correct_count / len(validation_dataset)
82
+ return accuracy
83
+
84
+ results = evaluate(validation_dataset)
85
+ print(f"Validation Results: {results}")
86
+ ```
87
+
88
+ Accuracy
89
+ - BAAI/bge-m3 : 0.961
90
+ - jaeyong2/bge-m3-Thai : 0.997
91
+
92
+
93
+
94
+ ### License
95
+ - BAAI/bge-m3 : https://choosealicense.com/licenses/mit/