SentenceTransformer based on Alibaba-NLP/gte-multilingual-mlm-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-mlm-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Alibaba-NLP/gte-multilingual-mlm-base
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("seongil-dn/gte-gold-bs64")
# Run inference
sentences = [
    '인간은 언제 달에 착륙했는가?',
    '아폴로 11호(Apollo 11)는 처음으로 달에 착륙한 유인 우주선이다. 아폴로 계획의 다섯 번째 유인우주비행인 동시에 세 번째 유인 달 탐사이기도 했다. 1969년 7월 16일에 발사되었으며 선장 닐 암스트롱, 사령선 조종사 마이클 콜린스, 달 착륙선 조종사 버즈 올드린이 탔다. 7월 20일 암스트롱과 올드린은 달에 발을 딛은 최초의 인류가 되었다. 당시 콜린스는 달 궤도를 돌고 있었다.',
    '사일런스는 닥터를 죽이기 위한 계획의 일환으로 우주복이 필요했으며, 전 인류에 걸쳐 \'암시 능력\'을 이용해 인류가 달에 가기 위한 연구를 하게 만들고 그 결과 인간이 만들어낸 우주복을 훔쳐 각종 최신 과학기술력을 탑재하여 개조한다. 하지만 사일런스가 "인간은 우릴 보고있을 때만 죽일 수 있다." 라고 말한 장면을 닥터가 아폴로 우주선의 송신 장치에 붙여놓아 아폴로 우주선이 달 착륙할때 TV 화면을 보고있던 전 세계 사람들에게 \'사일런스를 죽여라\'라는 암시가 걸리고 그 결과 사일런스는 1969년를 기점으로 더이상 인류에게 암시를 하지 못하게 되었다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 5,376 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 6 tokens mean: 14.66 tokens max: 83 tokens	min: 25 tokens mean: 151.18 tokens max: 512 tokens	min: 16 tokens mean: 169.71 tokens max: 512 tokens

Samples:

anchor	positive	negative
`로마의 면적은 서울시의 2배인가요?`	로마()는 이탈리아의 수도이자 라치오주의 주도로, 테베레 강 연안에 있다. 로마시의 행정구역 면적은 1,285km로 서울시의 2배정도이고, 2014년 인구는 290여만명이다. 로마시 권역의 인구는 430여만명이다. 로마 대도시현의 인구는 400만이 넘지만 밀라노나 나폴리 대도시현에 비해 면적이 3~4배 넓은 편이고 되려 로마시의 면적과 밀라노와 나폴리의 대도시현의 면적이 비슷하므로 세 도시 모두 300만 정도로 비슷한 규모의 도시라 볼 수 있다.	도봉구는 서울시청으로부터 약12km 동북부에 구의 중심인 방학동이 위치하며, 구 전체면적은 20.84km로 서울특별시 면적의 3.4%를 차지하고 있다. 도봉구 면적 중에서 가장 많이 차지하는 부분은 북한산국립공원을 비롯한 공원으로, 구면적의 48.2%인 10.05km에 달하고 있다. 서울시의 최북단에 위치한 도봉구는 동쪽으로 노원구 상계동과, 서쪽은 강북구 수유동·우이동과, 남쪽은 노원구 월계동 및 강북구 번동과 북쪽은 의정부시 호원동 등과 접하고 있는 서울 동북부의 관문 지역이다.
`로마의 면적은 서울시의 2배인가요?`	로마()는 이탈리아의 수도이자 라치오주의 주도로, 테베레 강 연안에 있다. 로마시의 행정구역 면적은 1,285km로 서울시의 2배정도이고, 2014년 인구는 290여만명이다. 로마시 권역의 인구는 430여만명이다. 로마 대도시현의 인구는 400만이 넘지만 밀라노나 나폴리 대도시현에 비해 면적이 3~4배 넓은 편이고 되려 로마시의 면적과 밀라노와 나폴리의 대도시현의 면적이 비슷하므로 세 도시 모두 300만 정도로 비슷한 규모의 도시라 볼 수 있다.	신안군(新安郡)은 유인도 72개와 무인도 932개로 이뤄져 있다. 섬의 면적만 (655km)에 달하고, 바다와 육지 넓이를 더한 신안군의 면적은 서울시의 22배나 된다. 이런 넓은 지역을 36곳의 치안센터와 파출소에 근무하는 목포경찰서 소속 경찰관 100여명이 관리해, 이전부터 치안 공백을 우려하는 주민들의 지적이 많았다. 신안군 한 사회단체 관계자는 "신안에 경찰서가 있었다면 염전 종사자 관리감독이 이처럼 방관 상태까지 이르지 않았을 것이다"고 주장했다.
`로마의 면적은 서울시의 2배인가요?`	로마()는 이탈리아의 수도이자 라치오주의 주도로, 테베레 강 연안에 있다. 로마시의 행정구역 면적은 1,285km로 서울시의 2배정도이고, 2014년 인구는 290여만명이다. 로마시 권역의 인구는 430여만명이다. 로마 대도시현의 인구는 400만이 넘지만 밀라노나 나폴리 대도시현에 비해 면적이 3~4배 넓은 편이고 되려 로마시의 면적과 밀라노와 나폴리의 대도시현의 면적이 비슷하므로 세 도시 모두 300만 정도로 비슷한 규모의 도시라 볼 수 있다.	`로마는 2015년 1월 1일부로 로마 수도 광역시의 행정 중심지가 되었다. 이 로마 수도 광역시는 로마 광역권에 북쪽으로 치비타베키아까지 뻗어나갔던 구 로마현을 대체했다. 로마 수도 광역시의 면적은 총 5,353제곱미터로 이탈리아에서 가장 크며, 리구리아주에 맞먹는다. 이와 더불어 로마는 라치오주의 주도이기도 하다.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 64
gradient_accumulation_steps: 8
learning_rate: 0.0001
adam_epsilon: 1e-07
num_train_epochs: 10
warmup_ratio: 0.1
fp16: True
dataloader_drop_last: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 64
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 8
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 0.0001
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-07
max_grad_norm: 1.0
num_train_epochs: 10
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: True
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss
0.0952	1	5.6584
0.1905	2	5.6663
0.2857	3	5.2883
0.3810	4	5.5523
0.4762	5	5.5037
0.5714	6	5.1176
0.6667	7	4.9949
0.7619	8	5.0314
0.8571	9	4.4824
0.9524	10	4.1297
1.0952	11	3.6362
1.1905	12	2.9783
1.2857	13	2.6855
1.3810	14	2.1482
1.4762	15	1.9731
1.5714	16	1.6655
1.6667	17	1.5604
1.7619	18	1.3974
1.8571	19	1.2828
1.9524	20	1.3931
2.0952	21	1.0056
2.1905	22	0.8308
2.2857	23	0.7171
2.3810	24	0.6162
2.4762	25	0.6624
2.5714	26	0.5194
2.6667	27	0.5322
2.7619	28	0.457
2.8571	29	0.5596
2.9524	30	0.5194
3.0952	31	0.3777
3.1905	32	0.324
3.2857	33	0.2961
3.3810	34	0.2515
3.4762	35	0.2501
3.5714	36	0.2552
3.6667	37	0.1956
3.7619	38	0.1688
3.8571	39	0.207
3.9524	40	0.2219
4.0952	41	0.1458
4.1905	42	0.1345
4.2857	43	0.1421
4.3810	44	0.1228
4.4762	45	0.1158
4.5714	46	0.1105
4.6667	47	0.0788
4.7619	48	0.079
4.8571	49	0.111
4.9524	50	0.1202
5.0952	51	0.0685
5.1905	52	0.0834
5.2857	53	0.0711
5.3810	54	0.0694
5.4762	55	0.0627
5.5714	56	0.0655
5.6667	57	0.0576
5.7619	58	0.0467
5.8571	59	0.0582
5.9524	60	0.07
6.0952	61	0.0399
6.1905	62	0.0498
6.2857	63	0.0509
6.3810	64	0.0495
6.4762	65	0.0399
6.5714	66	0.0305
6.6667	67	0.0202
6.7619	68	0.0205
6.8571	69	0.0321
6.9524	70	0.048
7.0952	71	0.0231
7.1905	72	0.0388
7.2857	73	0.0241
7.3810	74	0.0227
7.4762	75	0.0241
7.5714	76	0.0252
7.6667	77	0.0202
7.7619	78	0.0171
7.8571	79	0.0277
7.9524	80	0.0352
8.0952	81	0.016
8.1905	82	0.0186
8.2857	83	0.0228
8.3810	84	0.0173
8.4762	85	0.0134
8.5714	86	0.0138
8.6667	87	0.0126
8.7619	88	0.0108
8.8571	89	0.0156
8.9524	90	0.0235
9.0952	91	0.0117
9.1905	92	0.0155
9.2857	93	0.0135
9.3810	94	0.0162
9.4762	95	0.0121
9.5714	96	0.0125
9.6667	97	0.0113
9.7619	98	0.0085
9.8571	99	0.0164
9.9524	100	0.0206

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.2.1
Transformers: 4.44.2
PyTorch: 2.4.0+cu121
Accelerate: 1.1.1
Datasets: 2.21.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

seongil-dn
/

gte-gold-bs64