juanma1907's picture
Upload folder using huggingface_hub
462dacf
## Model compatibility
- [ ] Verify compatibility with Llama-2 34B once released
## GPU compatibility (etc.)
- [ ] Optimizations for ROCm
- [ ] Optimizations for RTX 20-series maybe
- [ ] Look into improving P40 performance
## Testing
- [ ] More testing on Llama 2 models
## Optimization
- [ ] Flash Attention 2.0 (?)
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
- [ ] C++ implementations of sampler functions
## Generation
- [ ] Optimized/batched beam search
- [ ] Allow stackable LoRAs
- [ ] Guidance or equivalent
## Interface
- [ ] Comprehensive API server (more than `example_flask.py`
## Web UI
- [ ] Controls to enable beam search
- [ ] Rewrite/refactor all the JavaScript and CSS
- [ ] Make it a little prettier
- [ ] Better error handling
- [ ] LoRA controls
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
## ??
- [ ] Support for other quantization methods
- [ ] Support for other LLM architectures
- [ ] Allow for backpropagation
- [ ] LoRA training features
- [ ] Soft prompt training