xDAN-AI
/

APUS-xDAN-4.0-MOE

GGUF

Inference Endpoints

conversational

Model card Files Files and versions Community

xDAN2099 commited on Mar 31, 2024

Commit

2a0bdf4

verified ·

1 Parent(s): 31ea432

Update README.md

Browse files

Files changed (1) hide show

README.md +28 -38

README.md CHANGED Viewed

@@ -12,47 +12,37 @@ further optimized with human-enhanced feedback algorithms to improve reasoning,
 For more comprehensive information, please visit our blog post and GitHub repository.
 https://github.com/shootime2021/APUS-xDAN-4.0-moe
-Model Details
-APUS-xDAN-4.0-MOE leverages the innovative Mixture of Experts (MoE) architecture, incorporating components from dense language models. Specifically, it inherits its capabilities from the highly performant xDAN-L2 Series. With a total of 136 billion parameters, of which 30 billion are activated during runtime, APUS-xDAN-4.0-MOE demonstrates unparalleled efficiency. Through advanced quantization techniques, our open-source version occupies a mere 42GB, making it seamlessly compatible with consumer-grade GPUs like the 4090 and 3090.
-Requirements
-The codebase for APUS-xDAN-4.0-MOE is integrated into the latest Hugging Face transformers library. We recommend building from source using the command pip install git+https://github.com/huggingface/transformers to ensure compatibility. Failure to do so may result in encountering the following error:
-Copy code
-Usage llama.cpp
 ## Usage
 ```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-torch.set_default_dtype(torch.bfloat16)
-tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/grok-1", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(
-    "xDAN-AI/APUS-xDAN-4.0-MOE",
-    trust_remote_code=True,
-    device_map="auto",
-    torch_dtype=torch.bfloat16,
-)
-model.eval()
-text = "Hi, xDAN-APUS4.0, nice to meet you!"
-input_ids = tokenizer(text, return_tensors="pt").input_ids
-input_ids = input_ids.cuda()
-attention_mask = torch.ones_like(input_ids)
-generate_kwargs = {}  # Add any additional args if you want
-inputs = {
-    "input_ids": input_ids,
-    "attention_mask": attention_mask,
-    **generate_kwargs,
-}
-outputs = model.generate(**inputs)
-print(outputs)
 ```
 License

 For more comprehensive information, please visit our blog post and GitHub repository.
 https://github.com/shootime2021/APUS-xDAN-4.0-moe
+# Model Details
+APUS-xDAN-4.0-MOE leverages the innovative Mixture of Experts (MoE) architecture, incorporating components from dense language models. Specifically, it inherits its capabilities from the highly performant xDAN-L2 Series. With a total of 136 billion parameters, of which 30 billion are activated during runtime, APUS-xDAN-4.0-MOE demonstrates unparalleled efficiency.
+Through advanced quantization techniques, our open-source version occupies a mere 42GB, making it seamlessly compatible with consumer-grade GPUs like the 4090 and 3090.
+The following specifications:
+- **Parameters:** 134B
+- **Architecture:** Mixture of 4 Experts (MoE)
+- **Experts Utilization:** 2 experts used per token
+- **Layers:** 60
+- **Attention Heads:** 56 for queries, 8 for keys/values
+- **Embedding Size:** 7,168
+- **Additional Features:**
+  - Rotary embeddings (RoPE)
+  - Supports activation sharding and 1.5bit~4bit quantization
+- **Maximum Sequence Length (context):** 32,768 tokens
 ## Usage
+### Initial
 ```python
+git clone https://github.com/ggerganov/llama.cpp.git
+make LLAMA_CUDA=1
+```
+### Interactive Chat
+```python
+./main -m xDAN-L2-moe-4x34b-v4-0326.IQ3_S.gguf \
+--prompt "You are a helpful assistant." --chatml \
+--interactive \
+--temp 0.7 \
+--ctx-size 4096
 ```
 License