mm4-3b / README.md
limin(gate)
Adding Evaluation Results (#1)
3fe0c49 verified
metadata
license: apache-2.0
datasets:
  - teknium/GPT4-LLM-Cleaned
  - vicgalle/alpaca-gpt4
model-index:
  - name: mm4-3b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 44.8
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 70.41
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 50.9
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.2
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 66.22
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 43.82
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=liminerity/mm4-3b
          name: Open LLM Leaderboard

MM4-3b

a llama based model i made thru extensive training and merging ill explain later i literally made so many models today

Title: Divergent Knowledge Enhancement through Retrograde Merging Strategies: Redefining Accuracy Perspectives in Language Model Evolution

Abstract: Have you picked up any bad habits, or have you ever learned to do something incorrectly, only to realize you must completly relearn whatever it is you're trying to accomplish? In this proposal, we present an innovative and unconventional approach to enhancing the performance and knowledge base of natural language models. Our proposed method, titled 'Divergent Knowledge Enhancement through Retrograde Merging Strategies' (DKE-RS), aims to challenge traditional practices in model development by incorporating a deliberate back-and-forth merger between high and low accuracy language models.

The initial conceptualization of DKE-RS stemmed from the realization that learning often encompasses both acquisition and unlearning, as encapsulated by the quote, "learning is just as sacred as unlearning." The proposed technique commences with a baseline model, 'blur-7b,' attaining an accuracy rate of 72.1%, subsequently merged with a Mistral fine-tuned model on the Dolphin dataset, only achieving a 46% accuracy level.

By deliberately merging with less accurate models and retracing the evolutionary process, DKE-RS aims to broaden the knowledge base of the resulting model. This strategy, dubbed 'making the bad good,' intentionally degrades the initial accuracy in an effort to refine it, thus breaking conventional iterative improvements for innovative progression.

image/png

The DKE-RS method challenges the status quo by not solely relying on a linear enhancement trajectory, instead adopting a more holistic and diverse approach. We anticipate that this non-linear merger process will further diversify the model's knowledge base, thereby creating a more resilient and well-rounded language generation tool, capable of handling complex contexts with a broader understanding.

Through thorough experimentation and analysis, we plan to assess the effectiveness and potential drawbacks of DKE-RS, comparing it to traditional merging techniques. The results from such evaluations will provide valuable insights into the efficacy of this divergent strategy in the landscape of natural language model development.

We posit that the Divergent Knowledge Enhancement through Retrograde Merging Strategies approach contributes a significant and compelling step forward in the field, provoking thought-provoking discourse about the nature of accuracy refinement and model progression.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 53.22
AI2 Reasoning Challenge (25-Shot) 44.80
HellaSwag (10-Shot) 70.41
MMLU (5-Shot) 50.90
TruthfulQA (0-shot) 43.20
Winogrande (5-shot) 66.22
GSM8k (5-shot) 43.82