Update README.md

b709de6 verified 3 months ago

3.69 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- llama-2
	- astronomy
	- astrophysics
	- arxiv
	inference: false
	base_model:
	- meta-llama/Llama-2-70b-hf
	---

	# AstroLLaMA-2-70B-Base_AIC

	AstroLLaMA-2-70B-Base_AIC is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-2-70b architecture on astronomical literature. This model was developed by the AstroMLab team and is, to our best knowledge, the first specialized 70B parameter-level LLM in astronomy. It is designed for next token prediction tasks and is not an instruct/chat model.

	## Model Details

	- Base Architecture: LLaMA-2-70b
	- Training Data: Abstract, Introduction, and Conclusion (AIC) sections from arXiv's astro-ph category papers (from arXiv's inception up to July 2023)
	- Data Processing: The training data was derived from LaTeX source files using regex-based extraction methods to identify and extract the relevant sections (Abstract, Introduction, and Conclusion).
	- Fine-tuning Method: Continual Pre-Training (CPT) using the LMFlow framework
	- Training Details:
	- Learning rate: 2 × 10⁻⁵
	- Total batch size: 160
	- Maximum token length: 2048
	- Warmup ratio: 0.03
	- Cosine decay schedule for learning rate reduction
	- Training duration: 1 epoch (approximately 2,000 A100 GPU hours)
	- Primary Use: Next token prediction for astronomy-related text generation and analysis
	- Reference: Pan et al. 2024 [Link to be added]

	## Generating text from a prompt

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load the model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-70b-base_aic")
	model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-70b-base_aic", device_map="auto")

	# Create the pipeline with explicit truncation
	from transformers import pipeline
	generator = pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer,
	device_map="auto",
	truncation=True,
	max_length=512
	)

	# Example prompt from an astronomy paper
	prompt = "In this letter, we report the discovery of the highest redshift, " \
	"heavily obscured, radio-loud QSO candidate selected using JWST NIRCam/MIRI, " \
	"mid-IR, sub-mm, and radio imaging in the COSMOS-Web field. "

	# Set seed for reproducibility
	torch.manual_seed(42)

	# Generate text
	generated_text = generator(prompt, do_sample=True)
	print(generated_text[0]['generated_text'])
	```

	## Model Performance and Significance

	AstroLLaMA-2-70B-Base_AIC demonstrates notable improvements over its baseline LLaMA-2-70B model, marking a crucial step in specialized astronomical LLMs. Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:

	\| Model \| Score (%) \|
	\|-------\|-----------\|
	\| <span style="color:green">AstroLLaMA-2-70B-Base (AstroMLab)</span> \| <span style="color:green">76.0</span> \|
	\| LLaMA-2-70B \| 70.7 \|
	\| LLaMA-3.1-8B \| 73.7 \|
	\| Gemma-2-9B \| 71.5 \|
	\| Qwen-2.5-7B \| 70.4 \|
	\| Yi-1.5-9B \| 68.4 \|
	\| InternLM-2.5-7B \| 64.5 \|
	\| Mistral-7B-v0.3 \| 63.9 \|
	\| ChatGLM3-6B \| 50.4 \|

	It demonstrates that training specialized LLMs can be effective, especially at larger model scales.


	## Ethical Considerations

	While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.

	## Citation

	If you use this model in your research, please cite:

	```
	[Citation for Pan et al. 2024 to be added]
	```