tuan.ljn commited on
Commit
ef54d41
1 Parent(s): b865c35

Add: add README

Browse files
Files changed (2) hide show
  1. .ipynb_checkpoints/README-checkpoint.md +95 -0
  2. README.md +59 -54
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - transformers
9
+ - sentence-embedding
10
+ license: apache-2.0
11
+ language:
12
+ - fr
13
+ metrics:
14
+ - pearsonr
15
+ - spearmanr
16
+ ---
17
+
18
+ # [bilingual-embedding-base](https://huggingface.co/Lajavaness/bilingual-embedding-base)
19
+
20
+ bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), a pre-trained language model based on the [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.
21
+
22
+
23
+ ## Full Model Architecture
24
+ ```
25
+ SentenceTransformer(
26
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
27
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
28
+ (2): Normalize()
29
+ )
30
+ ```
31
+
32
+ ## Training and Fine-tuning process
33
+ #### Stage 1: NLI Training
34
+ - Dataset: [(SNLI+XNLI) for english+french]
35
+ - Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
36
+ ### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
37
+ - Dataset: [STSB-fr and en]
38
+ - Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.
39
+ ### Stage 4: Advanced Augmentation Fine-tuning
40
+ - Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
41
+ - Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.
42
+
43
+
44
+ ## Usage:
45
+
46
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
47
+
48
+ ```
49
+ pip install -U sentence-transformers
50
+ pip install -q pyvi
51
+ ```
52
+
53
+ Then you can use the model like this:
54
+
55
+ ```python
56
+ from sentence_transformers import SentenceTransformer
57
+ from pyvi.ViTokenizer import tokenize
58
+
59
+ sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]
60
+
61
+ model = SentenceTransformer('Lajavaness/bilingual-embedding-base', trust_remote_code=True)
62
+ print(embeddings)
63
+
64
+ ```
65
+
66
+
67
+
68
+
69
+
70
+ ## Evaluation
71
+
72
+ TODO
73
+
74
+ ## Citation
75
+
76
+ @article{conneau2019unsupervised,
77
+ title={Unsupervised cross-lingual representation learning at scale},
78
+ author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
79
+ journal={arXiv preprint arXiv:1911.02116},
80
+ year={2019}
81
+ }
82
+
83
+ @article{reimers2019sentence,
84
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
85
+ author={Nils Reimers, Iryna Gurevych},
86
+ journal={https://arxiv.org/abs/1908.10084},
87
+ year={2019}
88
+ }
89
+
90
+ @article{thakur2020augmented,
91
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
92
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
93
+ journal={arXiv e-prints},
94
+ pages={arXiv--2010},
95
+ year={2020}
README.md CHANGED
@@ -5,86 +5,91 @@ tags:
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
-
 
 
 
 
 
 
 
9
  ---
10
 
11
- # {MODEL_NAME}
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
 
 
 
 
 
 
 
14
 
15
- <!--- Describe your model here -->
16
 
17
- ## Usage (Sentence-Transformers)
18
 
19
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
 
21
  ```
22
  pip install -U sentence-transformers
 
23
  ```
24
 
25
  Then you can use the model like this:
26
 
27
  ```python
28
  from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('{MODEL_NAME}')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
 
 
37
 
38
- ## Evaluation Results
 
39
 
40
- <!--- Describe how your model was evaluated -->
41
 
42
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
43
 
44
 
45
- ## Training
46
- The model was trained with the parameters:
47
 
48
- **DataLoader**:
49
 
50
- `torch.utils.data.dataloader.DataLoader` of length 1438 with parameters:
51
- ```
52
- {'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
53
- ```
54
 
55
- **Loss**:
56
-
57
- `__main__.CosineSimilarityLoss`
58
-
59
- Parameters of the fit()-Method:
60
- ```
61
- {
62
- "epochs": 10,
63
- "evaluation_steps": 1000,
64
- "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
65
- "max_grad_norm": 1,
66
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
67
- "optimizer_params": {
68
- "eps": 1e-06,
69
- "lr": 5e-07
70
- },
71
- "scheduler": "WarmupLinear",
72
- "steps_per_epoch": null,
73
- "warmup_steps": 1438,
74
- "weight_decay": 0.01
75
- }
76
- ```
77
 
 
78
 
79
- ## Full Model Architecture
80
- ```
81
- SentenceTransformer(
82
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
83
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
84
- (2): Normalize()
85
- )
86
- ```
87
 
88
- ## Citing & Authors
 
 
 
 
 
89
 
90
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
+ - transformers
9
+ - sentence-embedding
10
+ license: apache-2.0
11
+ language:
12
+ - fr
13
+ metrics:
14
+ - pearsonr
15
+ - spearmanr
16
  ---
17
 
18
+ # [bilingual-embedding-base](https://huggingface.co/Lajavaness/bilingual-embedding-base)
19
+
20
+ bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), a pre-trained language model based on the [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.
21
+
22
+
23
+ ## Full Model Architecture
24
+ ```
25
+ SentenceTransformer(
26
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
27
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
28
+ (2): Normalize()
29
+ )
30
+ ```
31
 
32
+ ## Training and Fine-tuning process
33
+ #### Stage 1: NLI Training
34
+ - Dataset: [(SNLI+XNLI) for english+french]
35
+ - Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
36
+ ### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
37
+ - Dataset: [STSB-fr and en]
38
+ - Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.
39
+ ### Stage 4: Advanced Augmentation Fine-tuning
40
+ - Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
41
+ - Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.
42
 
 
43
 
44
+ ## Usage:
45
 
46
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
47
 
48
  ```
49
  pip install -U sentence-transformers
50
+ pip install -q pyvi
51
  ```
52
 
53
  Then you can use the model like this:
54
 
55
  ```python
56
  from sentence_transformers import SentenceTransformer
57
+ from pyvi.ViTokenizer import tokenize
 
 
 
 
 
 
58
 
59
+ sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]
60
 
61
+ model = SentenceTransformer('Lajavaness/bilingual-embedding-base', trust_remote_code=True)
62
+ print(embeddings)
63
 
64
+ ```
65
 
 
66
 
67
 
 
 
68
 
 
69
 
70
+ ## Evaluation
 
 
 
71
 
72
+ TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ ## Citation
75
 
76
+ @article{conneau2019unsupervised,
77
+ title={Unsupervised cross-lingual representation learning at scale},
78
+ author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
79
+ journal={arXiv preprint arXiv:1911.02116},
80
+ year={2019}
81
+ }
 
 
82
 
83
+ @article{reimers2019sentence,
84
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
85
+ author={Nils Reimers, Iryna Gurevych},
86
+ journal={https://arxiv.org/abs/1908.10084},
87
+ year={2019}
88
+ }
89
 
90
+ @article{thakur2020augmented,
91
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
92
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
93
+ journal={arXiv e-prints},
94
+ pages={arXiv--2010},
95
+ year={2020}