chainbase commited on
Commit
a9344e7
1 Parent(s): f2bacc6

init repository file

Browse files
Files changed (2) hide show
  1. .gitattributes +1 -0
  2. README.md +106 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,109 @@
1
  ---
 
2
  license: llama3.1
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+
3
  license: llama3.1
4
+ base_model: Llama-3.1-8B-Instruct
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+
8
  ---
9
+
10
+ # Changelog
11
+
12
+ - [2024.10.30] Released [Theia-Llama-3.1-8B-v1.1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1.1), supervised fine-tuned with abundant crypto fundamental knowledge and popular projects.
13
+ - [2024.10.10] Released [Theia-Llama-3.1-8B-v1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1)
14
+
15
+ # Theia-Llama-3.1-8B
16
+
17
+ **Theia-Llama-3.1-8B is an open-source crypto LLM, trained with carefully-designed dataset from the crypto field.**
18
+
19
+ ## Technical Implementation
20
+
21
+ ### Crypto-Oriented Dataset
22
+
23
+ The training dataset is curated from two primary sources to create a comprehensive representation of blockchain
24
+ projects. The first source is data collected from **CoinMarketCap**, focusing on the top **2000 projects** ranked by
25
+ market capitalization. This includes a wide range of project-specific documents such as whitepapers, official blog
26
+ posts, and news articles. The second core component of the dataset comprises detailed research reports on these projects
27
+ gathered from various credible sources on the internet, providing in-depth insights into project fundamentals,
28
+ development progress, and market impact. After constructing the dataset, both manual and algorithmic filtering are
29
+ applied to ensure data accuracy and eliminate redundancy.
30
+
31
+ ### Model Fine-tuning and Quantization
32
+
33
+ The Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B), specifically tailored for the cryptocurrency
34
+ domain. We employed LoRA (Low-Rank Adaptation) to fine-tune the model effectively, leveraging its ability to adapt large
35
+ pre-trained models to specific tasks with a smaller computational footprint. Our training methodology is further
36
+ enhanced through the use of LLaMA Factory, an open-source training framework. We integrate **DeepSpeed**, Microsoft's
37
+ distributed training engine, to optimize resource utilization and training efficiency. Techniques such as ZeRO (Zero
38
+ Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism are employed to accelerate the
39
+ training process and reduce memory consumption. A fine-tuned model is also built using the
40
+ novel [D-DoRA](https://docs.chainbase.com/theia/Developers/Glossary/D2ORA), a decentralized training scheme, by our
41
+ Chainbase Labs. Since the LoRA version is much easier to deploy and play with for developers, we release the LoRA
42
+ version first for the Crypto AI community.
43
+
44
+ In addition to fine-tuning, we have quantized the model to optimize it for efficient deployment, specifically into the
45
+ GGUF format. Model quantization is a process that reduces the precision of the model's weights from floating-point
46
+ (typically FP16 or FP32) to lower-bit representations.
47
+ The primary benefit of quantization is that it significantly reduces the model's memory footprint and
48
+ improves inference speed while maintaining an acceptable level of accuracy. This makes the model more accessible for use
49
+ in resource-constrained environments, such as on edge devices or lower-tier GPUs.
50
+
51
+ ## Benchmark
52
+
53
+ To evaluate the current LLMs in the crypto domain, we have proposed a benchmark for evaluating Crypto AI Models, which
54
+ is the first AI model benchmark tailored specifically for the crypto domain. The models are evaluated across seven
55
+ dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities,
56
+ etc. A detailed paper will follow to elaborate on this benchmark. Here we initially release the results of benchmarking
57
+ the understanding and generation capabilities in the crypto domain on 11 open-source and close-source LLMs from OpenAI,
58
+ Google, Meta, Qwen, and DeepSeek. For the open-source LLMs, we choose the models with the similar parameter size as
59
+ ours (~8b). For the close-source LLMs, we choose the popular models with most end-users.
60
+
61
+ | Model | Perplexity ↓ | BERT ↑ |
62
+ |---------------------------|--------------|-----------|
63
+ | **Theia-Llama-3.1-8B-v1** | **1.184** | **0.861** |
64
+ | ChatGPT-4o | 1.256 | 0.837 |
65
+ | ChatGPT-4o-mini | 1.257 | 0.794 |
66
+ | ChatGPT-3.5-turbo | 1.233 | 0.838 |
67
+ | Claude-3-sonnet (~70b) | N.A. | 0.848 |
68
+ | Gemini-1.5-Pro | N.A. | 0.830 |
69
+ | Gemini-1.5-Flash | N.A. | 0.828 |
70
+ | Llama-3.1-8B-Instruct | 1.270 | 0.835 |
71
+ | Mistral-7B-Instruct-v0.3 | 1.258 | 0.844 |
72
+ | Qwen2.5-7B-Instruct | 1.392 | 0.832 |
73
+ | Gemma-2-9b | 1.248 | 0.832 |
74
+ | Deepseek-llm-7b-chat | 1.348 | 0.846 |
75
+
76
+ ## System Prompt
77
+
78
+ The system prompt used for training this model is:
79
+
80
+ ```
81
+ You are a helpful assistant who will answer crypto related questions.
82
+ ```
83
+
84
+ ## Chat Format
85
+
86
+ As mentioned above, the model uses the standard Llama 3.1 chat format. Here’s an example:
87
+
88
+ ```
89
+ <|begin_of_text|><|start_header_id|>system<|end_header_id|>
90
+
91
+ Cutting Knowledge Date: December 2023
92
+ Today Date: 29 September 2024
93
+
94
+ You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
95
+
96
+ What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
97
+ ```
98
+
99
+ ## Tips for Performance
100
+
101
+ We are initially recommending a set of parameters.
102
+
103
+ ```
104
+ sequence length = 256
105
+ temperature = 0
106
+ top-k-sampling = -1
107
+ top-p = 1
108
+ context window = 39680
109
+ ```