zeren's picture
Update README.md
8595296 verified
metadata
license: llama3
language:
  - tr
model-index:
  - name: Kocdigital-LLM-8b-v0.1
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge TR
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc
            value: 44.03
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag TR
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc
            value: 46.73
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU TR
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 49.11
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA TR
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: acc
            name: accuracy
            value: 48.21
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande TR
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc
            value: 54.98
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k TR
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 51.78
            name: accuracy

KOCDIGITAL LLM

Kocdigital-LLM-8b-v0.1

This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. The training process involved using the QLORA method.

Model Details

  • Base Model: Llama3 8B based LLM
  • Training Dataset: High Quality Turkish instruction sets
  • Training Method: SFT with QLORA

QLORA Fine-Tuning Configuration

  • lora_alpha: 128
  • lora_dropout: 0
  • r: 64
  • target_modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
  • bias: "none"

Usage Examples


from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"KOCDIGITAL/Kocdigital-LLM-8b-v0.1", 
max_seq_length=4096)
model = AutoModelForCausalLM.from_pretrained(
    "KOCDIGITAL/Kocdigital-LLM-8b-v0.1",
    load_in_4bit=True,
)

system = 'Sen Türkçe konuşan genel amaçlı bir asistansın. Her zaman kullanıcının verdiği talimatları doğru, kısa ve güzel bir gramer ile yerine getir.'

template = "{}\n\n###Talimat\n{}\n###Yanıt\n"
content = template.format(system, 'Türkiyenin 3 büyük ilini listeler misin.')

conv = []
conv.append({'role': 'user', 'content': content})
inputs = tokenizer.apply_chat_template(conv, 
                                       tokenize=False, 
                                       add_generation_prompt=True, 
                                       return_tensors="pt")

print(inputs)

inputs = tokenizer([inputs], 
                   return_tensors = "pt",
                   add_special_tokens=False).to("cuda")

outputs = model.generate(**inputs, 
                         max_new_tokens = 512, 
                         use_cache = True, 
                         do_sample = True, 
                         top_k = 50, 
                         top_p = 0.60, 
                         temperature = 0.3, 
                         repetition_penalty=1.1)

out_text = tokenizer.batch_decode(outputs)[0]
print(out_text)

[Open LLM Turkish Leaderboard v0.2 Evaluation Results]

Metric Value
Avg. 49.11
AI2 Reasoning Challenge_tr-v0.2 44.03
HellaSwag_tr-v0.2 46.73
MMLU_tr-v0.2 49.11
TruthfulQA_tr-v0.2 48.51
Winogrande _tr-v0.2 54.98
GSM8k_tr-v0.2 51.78

Considerations on Limitations, Risks, Bias, and Ethical Factors

Limitations and Recognized Biases

  • Core Functionality and Usage: KocDigital LLM, functioning as an autoregressive language model, is primarily purposed for predicting the subsequent token within a text sequence. Although commonly applied across different contexts, it's crucial to acknowledge that comprehensive real-world testing has not been conducted. Therefore, its efficacy and consistency in diverse situations are largely unvalidated.

  • Language Understanding and Generation: The model's training is mainly focused on standard English and Turkish. Its proficiency in grasping and generating slang, colloquial language, or different languages might be restricted, possibly resulting in errors or misinterpretations.

  • Production of Misleading Information: Users should acknowledge that KocDigital LLM might generate incorrect or deceptive information. Results should be viewed as initial prompts or recommendations rather than absolute conclusions.

Ethical Concerns and Potential Risks

  • Risk of Misuse: KocDigital LLM carries the potential for generating language that could be offensive or harmful. We strongly advise against its utilization for such purposes and stress the importance of conducting thorough safety and fairness assessments tailored to specific applications before implementation.

  • Unintended Biases and Content: The model underwent training on a vast corpus of text data without explicit vetting for offensive material or inherent biases. Consequently, it may inadvertently generate content reflecting these biases or inaccuracies.

  • Toxicity: Despite efforts to curate appropriate training data, the model has the capacity to produce harmful content, particularly when prompted explicitly. We encourage active participation from the open-source community to devise strategies aimed at mitigating such risks.

Guidelines for Secure and Ethical Utilization

  • Human Oversight: We advocate for the integration of a human oversight mechanism or the utilization of filters to oversee and enhance the quality of outputs, particularly in applications accessible to the public. This strategy can assist in minimizing the likelihood of unexpectedly generating objectionable content.

  • Tailored Testing for Specific Applications: Developers planning to utilize KocDigital LLM should execute comprehensive safety assessments and optimizations customized to their unique applications. This step is essential as the model's responses may exhibit unpredictability and occasional biases, inaccuracies, or offensive outputs.

  • Responsible Development and Deployment: Developers and users of KocDigital LLM bear the responsibility for ensuring its ethical and secure application. We encourage users to be cognizant of the model's limitations and to implement appropriate measures to prevent misuse or adverse outcomes.