Update README.md
Browse files
README.md
CHANGED
@@ -12,47 +12,37 @@ further optimized with human-enhanced feedback algorithms to improve reasoning,
|
|
12 |
For more comprehensive information, please visit our blog post and GitHub repository.
|
13 |
https://github.com/shootime2021/APUS-xDAN-4.0-moe
|
14 |
|
15 |
-
Model Details
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
|
|
|
|
|
|
|
|
26 |
## Usage
|
27 |
|
|
|
28 |
```python
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
)
|
42 |
-
model.eval()
|
43 |
-
|
44 |
-
text = "Hi, xDAN-APUS4.0, nice to meet you!"
|
45 |
-
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
46 |
-
input_ids = input_ids.cuda()
|
47 |
-
attention_mask = torch.ones_like(input_ids)
|
48 |
-
generate_kwargs = {} # Add any additional args if you want
|
49 |
-
inputs = {
|
50 |
-
"input_ids": input_ids,
|
51 |
-
"attention_mask": attention_mask,
|
52 |
-
**generate_kwargs,
|
53 |
-
}
|
54 |
-
outputs = model.generate(**inputs)
|
55 |
-
print(outputs)
|
56 |
```
|
57 |
License
|
58 |
|
|
|
12 |
For more comprehensive information, please visit our blog post and GitHub repository.
|
13 |
https://github.com/shootime2021/APUS-xDAN-4.0-moe
|
14 |
|
15 |
+
# Model Details
|
16 |
+
APUS-xDAN-4.0-MOE leverages the innovative Mixture of Experts (MoE) architecture, incorporating components from dense language models. Specifically, it inherits its capabilities from the highly performant xDAN-L2 Series. With a total of 136 billion parameters, of which 30 billion are activated during runtime, APUS-xDAN-4.0-MOE demonstrates unparalleled efficiency.
|
17 |
+
Through advanced quantization techniques, our open-source version occupies a mere 42GB, making it seamlessly compatible with consumer-grade GPUs like the 4090 and 3090.
|
18 |
+
The following specifications:
|
19 |
+
|
20 |
+
- **Parameters:** 134B
|
21 |
+
- **Architecture:** Mixture of 4 Experts (MoE)
|
22 |
+
- **Experts Utilization:** 2 experts used per token
|
23 |
+
- **Layers:** 60
|
24 |
+
- **Attention Heads:** 56 for queries, 8 for keys/values
|
25 |
+
- **Embedding Size:** 7,168
|
26 |
+
- **Additional Features:**
|
27 |
+
- Rotary embeddings (RoPE)
|
28 |
+
- Supports activation sharding and 1.5bit~4bit quantization
|
29 |
+
- **Maximum Sequence Length (context):** 32,768 tokens
|
30 |
## Usage
|
31 |
|
32 |
+
### Initial
|
33 |
```python
|
34 |
+
|
35 |
+
git clone https://github.com/ggerganov/llama.cpp.git
|
36 |
+
make LLAMA_CUDA=1
|
37 |
+
```
|
38 |
+
### Interactive Chat
|
39 |
+
```python
|
40 |
+
|
41 |
+
./main -m xDAN-L2-moe-4x34b-v4-0326.IQ3_S.gguf \
|
42 |
+
--prompt "You are a helpful assistant." --chatml \
|
43 |
+
--interactive \
|
44 |
+
--temp 0.7 \
|
45 |
+
--ctx-size 4096
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
```
|
47 |
License
|
48 |
|