tingyuansen commited on
Commit
5c4159b
1 Parent(s): 7359c1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md CHANGED
@@ -1,3 +1,99 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - llama-2
8
+ - astronomy
9
+ - astrophysics
10
+ - arxiv
11
+ inference: false
12
+ base_model:
13
+ - meta-llama/Llama-2-70b-hf
14
  ---
15
+
16
+ # AstroLLaMA-2-70B-Base_AIC
17
+
18
+ AstroLLaMA-2-70B-Base_AIC is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-2-70b architecture on astronomical literature. This model was developed by the AstroMLab team and is, to our best knowledge, the first specialized 70B parameter-level LLM in astronomy. It is designed for next token prediction tasks and is not an instruct/chat model.
19
+
20
+ ## Model Details
21
+
22
+ - **Base Architecture**: LLaMA-2-70b
23
+ - **Training Data**: Abstract, Introduction, and Conclusion (AIC) sections from arXiv's astro-ph category papers (from arXiv's inception up to July 2023)
24
+ - **Data Processing**: The training data was derived from LaTeX source files using regex-based extraction methods to identify and extract the relevant sections (Abstract, Introduction, and Conclusion).
25
+ - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
26
+ - **Training Details**:
27
+ - Learning rate: 2 × 10⁻⁵
28
+ - Total batch size: 160
29
+ - Maximum token length: 2048
30
+ - Warmup ratio: 0.03
31
+ - Cosine decay schedule for learning rate reduction
32
+ - Training duration: 1 epoch (approximately 2,000 A100 GPU hours)
33
+ - **Primary Use**: Next token prediction for astronomy-related text generation and analysis
34
+ - **Reference**: Pan et al. 2024 [Link to be added]
35
+
36
+ ## Generating text from a prompt
37
+
38
+ ```python
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+ import torch
41
+
42
+ # Load the model and tokenizer
43
+ tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-70b-base_aic")
44
+ model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-70b-base_aic", device_map="auto")
45
+
46
+ # Create the pipeline with explicit truncation
47
+ from transformers import pipeline
48
+ generator = pipeline(
49
+ "text-generation",
50
+ model=model,
51
+ tokenizer=tokenizer,
52
+ device_map="auto",
53
+ truncation=True,
54
+ max_length=512
55
+ )
56
+
57
+ # Example prompt from an astronomy paper
58
+ prompt = "In this letter, we report the discovery of the highest redshift, " \
59
+ "heavily obscured, radio-loud QSO candidate selected using JWST NIRCam/MIRI, " \
60
+ "mid-IR, sub-mm, and radio imaging in the COSMOS-Web field. "
61
+
62
+ # Set seed for reproducibility
63
+ torch.manual_seed(42)
64
+
65
+ # Generate text
66
+ generated_text = generator(prompt, do_sample=True)
67
+ print(generated_text[0]['generated_text'])
68
+ ```
69
+
70
+ ## Model Performance and Significance
71
+
72
+ AstroLLaMA-2-70B-Base_AIC demonstrates notable improvements over its baseline LLaMA-2-70B model, marking a crucial step in specialized astronomical LLMs. Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
73
+
74
+ | Model | Score (%) |
75
+ |-------|-----------|
76
+ | **<span style="color:green">AstroLLaMA-2-70B-Base_AIC (AstroMLab)</span>** | **<span style="color:green">76.0</span>** |
77
+ | LLaMA-3.1-8B | 73.7 |
78
+ | LLaMA-2-70B | 73.9 |
79
+ | Gemma-2-9B | 71.5 |
80
+ | Qwen-2.5-7B | 70.4 |
81
+ | Yi-1.5-9B | 68.4 |
82
+ | InternLM-2.5-7B | 64.0 |
83
+ | Mistral-7B-v0.3 | 63.9 |
84
+ | ChatGLM3-6B | 50.4 |
85
+
86
+ It demonstrates that training specialized LLMs can be effective, especially at larger model scales.
87
+
88
+
89
+ ## Ethical Considerations
90
+
91
+ While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.
92
+
93
+ ## Citation
94
+
95
+ If you use this model in your research, please cite:
96
+
97
+ ```
98
+ [Citation for Pan et al. 2024 to be added]
99
+ ```