nypgd's picture
Add new SentenceTransformer model.
b91fddb verified
metadata
base_model: sentence-transformers/paraphrase-xlm-r-multilingual-v1
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:38739
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: '''Turks ve Caicos Adaları''ndaki Afrikalıların nüfusu nedir?'
    sentences:
      - |-
        CREATE TABLEethnicGroup (
            Country TEXT,
            Name TEXT PRIMARY KEY,
            Percentage REAL,
            FOREIGN KEY (Country) REFERENCES country(None)
        );
      - |-
        CREATE TABLEPatient (
            ID INTEGER PRIMARY KEY,
            SEX TEXT,
            Birthday DATE,
            Description DATE,
            First Date DATE,
            Admission TEXT,
            Diagnosis TEXT
        );
      - |-
        CREATE TABLEwrites (
            paperId INTEGER PRIMARY KEY,
            authorId INTEGER,
            FOREIGN KEY (authorId) REFERENCES author(authorId),
            FOREIGN KEY (paperId) REFERENCES paper(paperId)
        );
  - source_sentence: Teksas'ın başkenti nedir
    sentences:
      - |-
        CREATE TABLEprofessor (
            EMP_NUM INT,
            DEPT_CODE varchar(10),
            PROF_OFFICE varchar(50),
            PROF_EXTENSION varchar(4),
            PROF_HIGH_DEGREE varchar(5),
            FOREIGN KEY (DEPT_CODE) REFERENCES DEPARTMENT(DEPT_CODE),
            FOREIGN KEY (EMP_NUM) REFERENCES EMPLOYEE(EMP_NUM)
        );
      - |-
        CREATE TABLEBusiness_Hours (
            business_id INTEGER PRIMARY KEY,
            day_id INTEGER,
            opening_time TEXT,
            closing_time TEXT,
            FOREIGN KEY (day_id) REFERENCES Days(None),
            FOREIGN KEY (business_id) REFERENCES Business(None)
        );
      - |-
        CREATE TABLEstate (
            state_name TEXT PRIMARY KEY,
            population INTEGER,
            area double,
            country_name varchar(3),
            capital TEXT,
            density double
        );
  - source_sentence: >-
      'Mad Max: Fury Road' filminde çalışan 10 ekibin işlerinin yanı sıra
      listeleyin.
    sentences:
      - |-
        CREATE TABLEmovie (
            movie_id INTEGER PRIMARY KEY,
            title TEXT,
            budget INTEGER,
            homepage TEXT,
            overview TEXT,
            popularity REAL,
            release_date DATE,
            revenue INTEGER,
            runtime INTEGER,
            movie_status TEXT,
            tagline TEXT,
            vote_average REAL,
            vote_count INTEGER
        );
      - |-
        CREATE TABLEstudent (
            STU_NUM INT PRIMARY KEY,
            STU_LNAME varchar(15),
            STU_FNAME varchar(15),
            STU_INIT varchar(1),
            STU_DOB datetime,
            STU_HRS INT,
            STU_CLASS varchar(2),
            STU_GPA float(8),
            STU_TRANSFER numeric,
            DEPT_CODE varchar(18),
            STU_PHONE varchar(4),
            PROF_NUM INT,
            FOREIGN KEY (DEPT_CODE) REFERENCES DEPARTMENT(DEPT_CODE)
        );
      - |-
        CREATE TABLEFinancial_transactions (
            transaction_id INTEGER,
            account_id INTEGER,
            invoice_number INTEGER,
            transaction_type VARCHAR(15),
            transaction_date DATETIME,
            transaction_amount DECIMAL(19,4),
            transaction_comment VARCHAR(255),
            other_transaction_details VARCHAR(255),
            FOREIGN KEY (account_id) REFERENCES Accounts(account_id),
            FOREIGN KEY (invoice_number) REFERENCES Invoices(invoice_number)
        );
  - source_sentence: >-
      Tüm müşterilerin ortalama yaşının %80'inden daha büyük yaştaki
      müşterilerin gelirlerini ve sakin sayısını listeler misiniz?
    sentences:
      - |-
        CREATE TABLECustomers (
            ID INTEGER PRIMARY KEY,
            SEX TEXT,
            MARITAL_STATUS TEXT,
            GEOID INTEGER,
            EDUCATIONNUM INTEGER,
            OCCUPATION TEXT,
            age INTEGER,
            FOREIGN KEY (GEOID) REFERENCES Demog(None)
        );
      - |-
        CREATE TABLEauthors (
            authID INTEGER PRIMARY KEY,
            lname TEXT,
            fname TEXT
        );
      - |-
        CREATE TABLEcoaches (
            coachID TEXT PRIMARY KEY,
            year INTEGER,
            tmID TEXT,
            lgID TEXT,
            stint INTEGER,
            won INTEGER,
            lost INTEGER,
            post_wins INTEGER,
            post_losses INTEGER,
            FOREIGN KEY (tmID) REFERENCES teams(tmID),
            FOREIGN KEY (year) REFERENCES teams(year)
        );
  - source_sentence: Eleanor Hunt'a ait kaç tane kiralama kimliği var?
    sentences:
      - |-
        CREATE TABLEsinger (
            Singer_ID INT PRIMARY KEY,
            Name TEXT,
            Country TEXT,
            Song_Name TEXT,
            Song_release_year TEXT,
            Age INT,
            Is_male bool
        );
      - |-
        CREATE TABLEdistrict (
            District_ID INT PRIMARY KEY,
            District_name TEXT,
            Headquartered_City TEXT,
            City_Population REAL,
            City_Area REAL
        );
      - |-
        CREATE TABLEcustomer (
            customer_id INTEGER PRIMARY KEY,
            store_id INTEGER,
            first_name TEXT,
            last_name TEXT,
            email TEXT,
            address_id INTEGER,
            active INTEGER,
            create_date DATETIME,
            last_update DATETIME,
            FOREIGN KEY (address_id) REFERENCES address(None),
            FOREIGN KEY (store_id) REFERENCES store(None)
        );

SentenceTransformer based on sentence-transformers/paraphrase-xlm-r-multilingual-v1

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-xlm-r-multilingual-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("nypgd/fine-tuned-sentence-transformer_last")
# Run inference
sentences = [
    "Eleanor Hunt'a ait kaç tane kiralama kimliği var?",
    'CREATE TABLEcustomer (\n    customer_id INTEGER PRIMARY KEY,\n    store_id INTEGER,\n    first_name TEXT,\n    last_name TEXT,\n    email TEXT,\n    address_id INTEGER,\n    active INTEGER,\n    create_date DATETIME,\n    last_update DATETIME,\n    FOREIGN KEY (address_id) REFERENCES address(None),\n    FOREIGN KEY (store_id) REFERENCES store(None)\n);',
    'CREATE TABLEdistrict (\n    District_ID INT PRIMARY KEY,\n    District_name TEXT,\n    Headquartered_City TEXT,\n    City_Population REAL,\n    City_Area REAL\n);',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 38,739 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 7 tokens
    • mean: 19.22 tokens
    • max: 63 tokens
    • min: 11 tokens
    • mean: 73.6 tokens
    • max: 128 tokens
  • Samples:
    sentence_0 sentence_1
    en büyük alana sahip eyaleti belirtin CREATE TABLEstate (
    state_name TEXT PRIMARY KEY,
    population INTEGER,
    area double,
    country_name varchar(3),
    capital TEXT,
    density double
    );
    Law & Order'ın hangi bölümleri Primetime Emmy Ödülleri'ne aday gösterildi? CREATE TABLEAward (
    award_id INTEGER PRIMARY KEY,
    organization TEXT,
    year INTEGER,
    award_category TEXT,
    award TEXT,
    series TEXT,
    episode_id TEXT,
    person_id TEXT,
    role TEXT,
    result TEXT,
    FOREIGN KEY (person_id) REFERENCES Person(person_id),
    FOREIGN KEY (episode_id) REFERENCES Episode(episode_id)
    );
    Albümü "Universal Music Group" etiketi altında yer alan tüm şarkıların isimleri nelerdir? CREATE TABLEtracklists (
    AlbumId INTEGER PRIMARY KEY,
    Position INTEGER,
    SongId INTEGER,
    FOREIGN KEY (AlbumId) REFERENCES Albums(AId),
    FOREIGN KEY (SongId) REFERENCES Songs(SongId)
    );
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.2064 500 0.5621
0.4129 1000 0.295
0.6193 1500 0.2644
0.8258 2000 0.2035
1.0322 2500 0.184
1.2386 3000 0.1237
1.4451 3500 0.1008
1.6515 4000 0.0984
1.8580 4500 0.0841
0.2064 500 0.1214
0.4129 1000 0.1139
0.6193 1500 0.11
0.8258 2000 0.0999

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.4
  • PyTorch: 2.4.0+cu121
  • Accelerate: 0.32.1
  • Datasets: 2.21.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}