|
## Model compatibility |
|
|
|
- [ ] Verify compatibility with Llama-2 34B once released |
|
|
|
## GPU compatibility (etc.) |
|
|
|
- [ ] Optimizations for ROCm |
|
- [ ] Optimizations for RTX 20-series maybe |
|
- [ ] Look into improving P40 performance |
|
|
|
## Testing |
|
|
|
- [ ] More testing on Llama 2 models |
|
|
|
## Optimization |
|
|
|
- [ ] Flash Attention 2.0 (?) |
|
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?) |
|
- [ ] C++ implementations of sampler functions |
|
|
|
## Generation |
|
|
|
- [ ] Optimized/batched beam search |
|
- [ ] Allow stackable LoRAs |
|
- [ ] Guidance or equivalent |
|
|
|
## Interface |
|
|
|
- [ ] Comprehensive API server (more than `example_flask.py` |
|
|
|
## Web UI |
|
|
|
- [ ] Controls to enable beam search |
|
- [ ] Rewrite/refactor all the JavaScript and CSS |
|
- [ ] Make it a little prettier |
|
- [ ] Better error handling |
|
- [ ] LoRA controls |
|
- [ ] Multiple chat modes with prompt templates (instruct, etc.) |
|
|
|
## ?? |
|
|
|
- [ ] Support for other quantization methods |
|
- [ ] Support for other LLM architectures |
|
- [ ] Allow for backpropagation |
|
- [ ] LoRA training features |
|
- [ ] Soft prompt training |