File size: 1,069 Bytes
462dacf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
## Model compatibility
- [ ] Verify compatibility with Llama-2 34B once released
## GPU compatibility (etc.)
- [ ] Optimizations for ROCm
- [ ] Optimizations for RTX 20-series maybe
- [ ] Look into improving P40 performance
## Testing
- [ ] More testing on Llama 2 models
## Optimization
- [ ] Flash Attention 2.0 (?)
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
- [ ] C++ implementations of sampler functions
## Generation
- [ ] Optimized/batched beam search
- [ ] Allow stackable LoRAs
- [ ] Guidance or equivalent
## Interface
- [ ] Comprehensive API server (more than `example_flask.py`
## Web UI
- [ ] Controls to enable beam search
- [ ] Rewrite/refactor all the JavaScript and CSS
- [ ] Make it a little prettier
- [ ] Better error handling
- [ ] LoRA controls
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
## ??
- [ ] Support for other quantization methods
- [ ] Support for other LLM architectures
- [ ] Allow for backpropagation
- [ ] LoRA training features
- [ ] Soft prompt training |