license: llama3.1
base_model: Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
Changelog
- [2024.10.30] Released Theia-Llama-3.1-8B-v1.1, supervised fine-tuned with abundant crypto fundamental knowledge and popular projects.
- [2024.10.10] Released Theia-Llama-3.1-8B-v1
Theia-Llama-3.1-8B
Theia-Llama-3.1-8B is an open-source crypto LLM, trained with carefully-designed dataset from the crypto field.
Technical Implementation
Crypto-Oriented Dataset
The training dataset is curated from two primary sources to create a comprehensive representation of blockchain projects. The first source is data collected from CoinMarketCap, focusing on the top 2000 projects ranked by market capitalization. This includes a wide range of project-specific documents such as whitepapers, official blog posts, and news articles. The second core component of the dataset comprises detailed research reports on these projects gathered from various credible sources on the internet, providing in-depth insights into project fundamentals, development progress, and market impact. After constructing the dataset, both manual and algorithmic filtering are applied to ensure data accuracy and eliminate redundancy.
Model Fine-tuning and Quantization
The Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B), specifically tailored for the cryptocurrency domain. We employed LoRA (Low-Rank Adaptation) to fine-tune the model effectively, leveraging its ability to adapt large pre-trained models to specific tasks with a smaller computational footprint. Our training methodology is further enhanced through the use of LLaMA Factory, an open-source training framework. We integrate DeepSpeed, Microsoft's distributed training engine, to optimize resource utilization and training efficiency. Techniques such as ZeRO (Zero Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism are employed to accelerate the training process and reduce memory consumption. A fine-tuned model is also built using the novel D-DoRA, a decentralized training scheme, by our Chainbase Labs. Since the LoRA version is much easier to deploy and play with for developers, we release the LoRA version first for the Crypto AI community.
In addition to fine-tuning, we have quantized the model to optimize it for efficient deployment, specifically into the GGUF format. Model quantization is a process that reduces the precision of the model's weights from floating-point (typically FP16 or FP32) to lower-bit representations. The primary benefit of quantization is that it significantly reduces the model's memory footprint and improves inference speed while maintaining an acceptable level of accuracy. This makes the model more accessible for use in resource-constrained environments, such as on edge devices or lower-tier GPUs.
Benchmark
To evaluate the current LLMs in the crypto domain, we have proposed a benchmark for evaluating Crypto AI Models, which is the first AI model benchmark tailored specifically for the crypto domain. The models are evaluated across seven dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities, etc. A detailed paper will follow to elaborate on this benchmark. Here we initially release the results of benchmarking the understanding and generation capabilities in the crypto domain on 11 open-source and close-source LLMs from OpenAI, Google, Meta, Qwen, and DeepSeek. For the open-source LLMs, we choose the models with the similar parameter size as ours (~8b). For the close-source LLMs, we choose the popular models with most end-users.
Model | Perplexity ↓ | BERT ↑ |
---|---|---|
Theia-Llama-3.1-8B-v1 | 1.184 | 0.861 |
ChatGPT-4o | 1.256 | 0.837 |
ChatGPT-4o-mini | 1.257 | 0.794 |
ChatGPT-3.5-turbo | 1.233 | 0.838 |
Claude-3-sonnet (~70b) | N.A. | 0.848 |
Gemini-1.5-Pro | N.A. | 0.830 |
Gemini-1.5-Flash | N.A. | 0.828 |
Llama-3.1-8B-Instruct | 1.270 | 0.835 |
Mistral-7B-Instruct-v0.3 | 1.258 | 0.844 |
Qwen2.5-7B-Instruct | 1.392 | 0.832 |
Gemma-2-9b | 1.248 | 0.832 |
Deepseek-llm-7b-chat | 1.348 | 0.846 |
System Prompt
The system prompt used for training this model is:
You are a helpful assistant who will answer crypto related questions.
Chat Format
As mentioned above, the model uses the standard Llama 3.1 chat format. Here’s an example:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 29 September 2024
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Tips for Performance
We are initially recommending a set of parameters.
sequence length = 256
temperature = 0
top-k-sampling = -1
top-p = 1
context window = 39680