SentenceTransformer based on tintnguyen/bert-base-vi-uncased-st

This is a sentence-transformers model finetuned from tintnguyen/bert-base-vi-uncased-st. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: tintnguyen/bert-base-vi-uncased-st
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tintnguyen/bert-base-vi-uncased-st-2")
# Run inference
sentences = [
    'những gì cần giáo dục để trở thành một y tá',
    'Nếu bạn quan tâm đến việc trở thành một y tá, bạn phải có bằng cử nhân và bằng tốt nghiệp, cũng như duy trì giấy phép và chứng chỉ hiện tại. Tiếp tục đọc để biết tổng quan về các bước giáo dục và yêu cầu chuyên môn của các học viên y tá. Tổng quan về Học viên Y tá. Những người hành nghề y tá là những y tá đã đăng ký (RN) có trình độ học vấn bổ sung cho phép họ đảm nhận vai trò nhà cung cấp dịch vụ chăm sóc sức khỏe ban đầu tương tự như một bác sĩ, bao gồm khả năng kê đơn thuốc.',
    'Mũi Capricorn ::: Mũi Capricorn là một mũi đất ở vùng bờ biển miền trung bang Queensland, Úc.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 4,859,206 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 4 tokens
mean: 10.61 tokens
max: 27 tokens

min: 19 tokens
mean: 84.55 tokens
max: 433 tokens

	anchor	positive
type	string	string
details	min: 4 tokens mean: 10.61 tokens max: 27 tokens	min: 19 tokens mean: 84.55 tokens max: 433 tokens

Samples:

anchor	positive
`ricardo emanuel silva là ai`	`Ricardo Enrique Silva ::: Ricardo Enrique Silva là một bác sĩ và nhà bất đồng chính kiến người Cuba.Ông đã bị chính quyền Cuba bắt trong vụ mùa xuân đen năm 2003 và bị xử phạt 10 năm tù.`
`gốc của ngôn ngữ yiddish là gì`	`Tiếng Yiddish là ngôn ngữ lịch sử của người Do Thái Ashkenazic (Trung và Đông Âu), và là ngôn ngữ văn học chính thứ ba trong lịch sử Do Thái, sau tiếng Do Thái cổ điển và (Do Thái) Aramaic .iddish là ngôn ngữ lịch sử của người Do Thái Ashkenazic (Trung và Đông Âu), và là ngôn ngữ văn học chính thứ ba trong lịch sử Do Thái, sau tiếng Do Thái cổ điển và tiếng Aram (Do Thái).`
`xã moulin-sous-touvent nằm ở quốc gia nào`	`Moulin-sous-Touvent ::: Moulin-sous-Touvent là một xã thuộc tỉnh Oise trong vùng Hauts-de-France phía bắc nước Pháp. Xã này nằm ở khu vực có độ cao trung bình 93 mét trên mực nước biển.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 300 evaluation samples
Columns: anchor and positive
Approximate statistics based on the first 300 samples:
anchor positive
type string string
details
min: 6 tokens
mean: 10.63 tokens
max: 23 tokens

min: 15 tokens
mean: 86.9 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 6 tokens mean: 10.63 tokens max: 23 tokens	min: 15 tokens mean: 86.9 tokens max: 512 tokens

Samples:

anchor	positive
`dân số của xã west prairie là bao nhiêu`	`Xã West Prairie, Quận Poinsett, Arkansas ::: Xã West Prairie (tiếng Anh: West Prairie Township) là một xã thuộc quận Poinsett, tiểu bang Arkansas, Hoa Kỳ. Năm 2010, dân số của xã này là 894 người.`
`đường giai quan là gì`	`Đường Cái Quan ::: Đường Cái Quan hay đường Thiên lý, cũng có khi gọi là đường Quan lộ, hay đường Quan báo là một con đường dài chạy từ miền Bắc Việt Nam đến miền Nam Việt Nam, chủ yếu đắp vào đầu thế kỷ 19.`
`đài bắc là gì`	Đài Bắc ::: Đài Bắc (tiếng Trung: 臺北市; bính âm: Táiběi Shì, Hán Việt: Đài Bắc thị; đọc theo IPA: tʰǎipèi trong tiếng Phổ thông) là thủ đô của Trung Hoa Dân Quốc (Đài Loan) và là thành phố trung tâm của một vùng đô thị lớn nhất tại Đài Loan, một trong sáu thành phố trực thuộc Trung ương của Đài Loan. Đài Bắc nằm ở đầu phía bắc của đảo chính và nằm bên sông Đạm Thủy, cách thành phố cảng Thái Bình Dương Cơ Long 25 km về phía đông bắc. Một thành phố ven biển khác, mà nay trở thành một quận của Tân Bắc là Đạm Thủy, nơi này cách Đài Bắc 20 km về phía tây bắc và nằm ở cửa con sông cùng tên thuộc eo biển Đài Loan. Đài Bắc nằm trên hai thung lũng tương đối hẹp tạo bởi sông Cơ Long (基隆河) và sông Tân Điếm (新店溪), hai sông hợp lưu tạo thành sông Đạm Thủy và chảy dọc theo ranh giới phía tây của thành phố.

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 64
per_device_eval_batch_size: 32
learning_rate: 2e-05
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 64
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	Validation Loss
0.0066	500	0.1818	-
0.0132	1000	0.1624	0.0828
0.0198	1500	0.1525	-
0.0263	2000	0.1316	0.0506
0.0329	2500	0.1182	-
0.0395	3000	0.1197	0.0450
0.0461	3500	0.1101	-
0.0527	4000	0.1057	0.0437
0.0593	4500	0.1031	-
0.0659	5000	0.0987	0.0459
0.0724	5500	0.0989	-
0.0790	6000	0.0978	0.0480
0.0856	6500	0.0877	-
0.0922	7000	0.0851	0.0396
0.0988	7500	0.0871	-
0.1054	8000	0.0878	0.0427
0.1120	8500	0.0875	-
0.1185	9000	0.0837	0.0388
0.1251	9500	0.0835	-
0.1317	10000	0.0796	0.0293
0.1383	10500	0.0835	-
0.1449	11000	0.0839	0.0351
0.1515	11500	0.0797	-
0.1580	12000	0.0789	0.0351
0.1646	12500	0.0791	-
0.1712	13000	0.0774	0.0354
0.1778	13500	0.08	-
0.1844	14000	0.074	0.0287
0.1910	14500	0.0745	-
0.1976	15000	0.0786	0.0307
0.2041	15500	0.0733	-
0.2107	16000	0.0733	0.0245
0.2173	16500	0.0749	-
0.2239	17000	0.0742	0.0289
0.2305	17500	0.0708	-
0.2371	18000	0.0714	0.0279
0.2437	18500	0.0755	-
0.2502	19000	0.0738	0.0252
0.2568	19500	0.0747	-
0.2634	20000	0.0738	0.0287
0.2700	20500	0.0722	-
0.2766	21000	0.0723	0.0279
0.2832	21500	0.0747	-
0.2898	22000	0.0713	0.0296
0.2963	22500	0.0721	-
0.3029	23000	0.0783	0.0318
0.3095	23500	0.0714	-
0.3161	24000	0.0727	0.0260
0.3227	24500	0.0701	-
0.3293	25000	0.0706	0.0313
0.3359	25500	0.0696	-
0.3424	26000	0.0722	0.0287
0.3490	26500	0.0684	-
0.3556	27000	0.071	0.0269
0.3622	27500	0.0694	-
0.3688	28000	0.0677	0.0322
0.3754	28500	0.0658	-
0.3820	29000	0.0676	0.0276
0.3885	29500	0.0666	-
0.3951	30000	0.0639	0.0251
0.4017	30500	0.067	-
0.4083	31000	0.0653	0.0221
0.4149	31500	0.064	-
0.4215	32000	0.0695	0.0261
0.4280	32500	0.0667	-
0.4346	33000	0.0641	0.0279
0.4412	33500	0.0632	-
0.4478	34000	0.0622	0.0212
0.4544	34500	0.0594	-
0.4610	35000	0.0611	0.0214
0.4676	35500	0.0614	-
0.4741	36000	0.0604	0.0186
0.4807	36500	0.06	-
0.4873	37000	0.0628	0.0196
0.4939	37500	0.0619	-
0.5005	38000	0.065	0.0194
0.5071	38500	0.0595	-
0.5137	39000	0.0614	0.0168
0.5202	39500	0.0585	-
0.5268	40000	0.0593	0.0199
0.5334	40500	0.0597	-
0.5400	41000	0.0557	0.0173
0.5466	41500	0.054	-
0.5532	42000	0.0586	0.0166
0.5598	42500	0.0535	-
0.5663	43000	0.0548	0.0169
0.5729	43500	0.0555	-
0.5795	44000	0.0555	0.0166
0.5861	44500	0.0579	-
0.5927	45000	0.0524	0.0234
0.5993	45500	0.0508	-
0.6059	46000	0.0604	0.0260
0.6124	46500	0.0562	-
0.6190	47000	0.0578	0.0217
0.6256	47500	0.0566	-
0.6322	48000	0.0556	0.0189
0.6388	48500	0.0538	-
0.6454	49000	0.0511	0.0178
0.6520	49500	0.0526	-
0.6585	50000	0.0528	0.0259
0.6651	50500	0.05	-
0.6717	51000	0.0531	0.0193
0.6783	51500	0.0572	-
0.6849	52000	0.0532	0.0184
0.6915	52500	0.0545	-
0.6980	53000	0.0557	0.0203
0.7046	53500	0.0542	-
0.7112	54000	0.0535	0.0174
0.7178	54500	0.0533	-
0.7244	55000	0.0523	0.0181
0.7310	55500	0.0527	-
0.7376	56000	0.0515	0.0237
0.7441	56500	0.0536	-
0.7507	57000	0.0523	0.0173
0.7573	57500	0.0498	-
0.7639	58000	0.0491	0.0162
0.7705	58500	0.0496	-
0.7771	59000	0.0503	0.0194
0.7837	59500	0.0505	-
0.7902	60000	0.0488	0.0241
0.7968	60500	0.0513	-
0.8034	61000	0.0522	0.0225
0.8100	61500	0.0507	-
0.8166	62000	0.0521	0.0219
0.8232	62500	0.0494	-
0.8298	63000	0.049	0.0169
0.8363	63500	0.0483	-
0.8429	64000	0.0492	0.0192
0.8495	64500	0.0494	-
0.8561	65000	0.0501	0.0180
0.8627	65500	0.0493	-
0.8693	66000	0.0492	0.0206
0.8759	66500	0.0473	-
0.8824	67000	0.0511	0.0216
0.8890	67500	0.0477	-
0.8956	68000	0.049	0.0216
0.9022	68500	0.0502	-
0.9088	69000	0.0548	0.0198
0.9154	69500	0.0474	-
0.9220	70000	0.0487	0.0183
0.9285	70500	0.0452	-
0.9351	71000	0.046	0.0161
0.9417	71500	0.0491	-
0.9483	72000	0.0461	0.0169
0.9549	72500	0.0505	-
0.9615	73000	0.05	0.0174
0.9680	73500	0.0506	-
0.9746	74000	0.0459	0.0168
0.9812	74500	0.0469	-
0.9878	75000	0.0444	0.0188
0.9944	75500	0.0513	-
1.0010	76000	0.0452	0.0190
1.0076	76500	0.0472	-
1.0141	77000	0.0466	0.0172
1.0207	77500	0.0497	-
1.0273	78000	0.0478	0.0169
1.0339	78500	0.0476	-
1.0405	79000	0.0492	0.0207
1.0471	79500	0.0464	-
1.0537	80000	0.0462	0.0176
1.0602	80500	0.0451	-
1.0668	81000	0.0461	0.0228
1.0734	81500	0.0465	-
1.0800	82000	0.0475	0.0201
1.0866	82500	0.0419	-
1.0932	83000	0.0406	0.0177
1.0998	83500	0.0431	-
1.1063	84000	0.0426	0.0190
1.1129	84500	0.0453	-
1.1195	85000	0.0407	0.0186
1.1261	85500	0.0417	-
1.1327	86000	0.0392	0.0154
1.1393	86500	0.0423	-
1.1459	87000	0.0414	0.0143
1.1524	87500	0.0418	-
1.1590	88000	0.0402	0.0148
1.1656	88500	0.0394	-
1.1722	89000	0.04	0.0136
1.1788	89500	0.0424	-
1.1854	90000	0.038	0.0131
1.1920	90500	0.0387	-
1.1985	91000	0.0422	0.0169
1.2051	91500	0.0367	-
1.2117	92000	0.0401	0.0137
1.2183	92500	0.0375	-
1.2249	93000	0.0394	0.0190
1.2315	93500	0.0372	-
1.2380	94000	0.0363	0.0160
1.2446	94500	0.0362	-
1.2512	95000	0.0371	0.0194
1.2578	95500	0.0363	-
1.2644	96000	0.0376	0.0147
1.2710	96500	0.0371	-
1.2776	97000	0.0363	0.0174
1.2841	97500	0.0363	-
1.2907	98000	0.0354	0.0172
1.2973	98500	0.0372	-
1.3039	99000	0.0358	0.0132
1.3105	99500	0.0353	-
1.3171	100000	0.0363	0.0131
1.3237	100500	0.0358	-
1.3302	101000	0.0359	0.0122
1.3368	101500	0.033	-
1.3434	102000	0.0356	0.0149
1.3500	102500	0.0323	-
1.3566	103000	0.0358	0.0124
1.3632	103500	0.034	-
1.3698	104000	0.0338	0.0141

Framework Versions

Python: 3.11.10
Sentence Transformers: 3.3.1
Transformers: 4.46.3
PyTorch: 2.5.1+cu124
Accelerate: 1.1.1
Datasets: 3.1.0
Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

tintnguyen
/

bert-base-vi-uncased-st-2

SentenceTransformer based on tintnguyen/bert-base-vi-uncased-st

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Training Details

Training Dataset

Unnamed Dataset

Evaluation Dataset

Unnamed Dataset

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MultipleNegativesRankingLoss

Model tree for tintnguyen/bert-base-vi-uncased-st-2