init repository file

a9344e7 29 days ago

5.56 kB

	---

	license: llama3.1
	base_model: Llama-3.1-8B-Instruct
	pipeline_tag: text-generation
	library_name: transformers

	---

	# Changelog

	- [2024.10.30] Released [Theia-Llama-3.1-8B-v1.1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1.1), supervised fine-tuned with abundant crypto fundamental knowledge and popular projects.
	- [2024.10.10] Released [Theia-Llama-3.1-8B-v1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1)

	# Theia-Llama-3.1-8B

	Theia-Llama-3.1-8B is an open-source crypto LLM, trained with carefully-designed dataset from the crypto field.

	## Technical Implementation

	### Crypto-Oriented Dataset

	The training dataset is curated from two primary sources to create a comprehensive representation of blockchain
	projects. The first source is data collected from CoinMarketCap, focusing on the top 2000 projects ranked by
	market capitalization. This includes a wide range of project-specific documents such as whitepapers, official blog
	posts, and news articles. The second core component of the dataset comprises detailed research reports on these projects
	gathered from various credible sources on the internet, providing in-depth insights into project fundamentals,
	development progress, and market impact. After constructing the dataset, both manual and algorithmic filtering are
	applied to ensure data accuracy and eliminate redundancy.

	### Model Fine-tuning and Quantization

	The Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B), specifically tailored for the cryptocurrency
	domain. We employed LoRA (Low-Rank Adaptation) to fine-tune the model effectively, leveraging its ability to adapt large
	pre-trained models to specific tasks with a smaller computational footprint. Our training methodology is further
	enhanced through the use of LLaMA Factory, an open-source training framework. We integrate DeepSpeed, Microsoft's
	distributed training engine, to optimize resource utilization and training efficiency. Techniques such as ZeRO (Zero
	Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism are employed to accelerate the
	training process and reduce memory consumption. A fine-tuned model is also built using the
	novel [D-DoRA](https://docs.chainbase.com/theia/Developers/Glossary/D2ORA), a decentralized training scheme, by our
	Chainbase Labs. Since the LoRA version is much easier to deploy and play with for developers, we release the LoRA
	version first for the Crypto AI community.

	In addition to fine-tuning, we have quantized the model to optimize it for efficient deployment, specifically into the
	GGUF format. Model quantization is a process that reduces the precision of the model's weights from floating-point
	(typically FP16 or FP32) to lower-bit representations.
	The primary benefit of quantization is that it significantly reduces the model's memory footprint and
	improves inference speed while maintaining an acceptable level of accuracy. This makes the model more accessible for use
	in resource-constrained environments, such as on edge devices or lower-tier GPUs.

	## Benchmark

	To evaluate the current LLMs in the crypto domain, we have proposed a benchmark for evaluating Crypto AI Models, which
	is the first AI model benchmark tailored specifically for the crypto domain. The models are evaluated across seven
	dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities,
	etc. A detailed paper will follow to elaborate on this benchmark. Here we initially release the results of benchmarking
	the understanding and generation capabilities in the crypto domain on 11 open-source and close-source LLMs from OpenAI,
	Google, Meta, Qwen, and DeepSeek. For the open-source LLMs, we choose the models with the similar parameter size as
	ours (~8b). For the close-source LLMs, we choose the popular models with most end-users.

	\| Model \| Perplexity ↓ \| BERT ↑ \|
	\|---------------------------\|--------------\|-----------\|
	\| Theia-Llama-3.1-8B-v1 \| 1.184 \| 0.861 \|
	\| ChatGPT-4o \| 1.256 \| 0.837 \|
	\| ChatGPT-4o-mini \| 1.257 \| 0.794 \|
	\| ChatGPT-3.5-turbo \| 1.233 \| 0.838 \|
	\| Claude-3-sonnet (~70b) \| N.A. \| 0.848 \|
	\| Gemini-1.5-Pro \| N.A. \| 0.830 \|
	\| Gemini-1.5-Flash \| N.A. \| 0.828 \|
	\| Llama-3.1-8B-Instruct \| 1.270 \| 0.835 \|
	\| Mistral-7B-Instruct-v0.3 \| 1.258 \| 0.844 \|
	\| Qwen2.5-7B-Instruct \| 1.392 \| 0.832 \|
	\| Gemma-2-9b \| 1.248 \| 0.832 \|
	\| Deepseek-llm-7b-chat \| 1.348 \| 0.846 \|

	## System Prompt

	The system prompt used for training this model is:

	```
	You are a helpful assistant who will answer crypto related questions.
	```

	## Chat Format

	As mentioned above, the model uses the standard Llama 3.1 chat format. Here’s an example:

	```
	<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>

	Cutting Knowledge Date: December 2023
	Today Date: 29 September 2024

	You are a helpful assistant<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>

	What is the capital of France?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	```

	## Tips for Performance

	We are initially recommending a set of parameters.

	```
	sequence length = 256
	temperature = 0
	top-k-sampling = -1
	top-p = 1
	context window = 39680
	```