A newer version of the Gradio SDK is available:
5.6.0
Model compatibility
- Verify compatibility with Llama-2 34B once released
GPU compatibility (etc.)
- Optimizations for ROCm
- Optimizations for RTX 20-series maybe
- Look into improving P40 performance
Testing
- More testing on Llama 2 models
Optimization
- Flash Attention 2.0 (?)
- Find a way to eliminate
ExLlamaAttention.repeat_kv
(custom attention kernel?) - C++ implementations of sampler functions
Generation
- Optimized/batched beam search
- Allow stackable LoRAs
- Guidance or equivalent
Interface
- Comprehensive API server (more than
example_flask.py
Web UI
- Controls to enable beam search
- Rewrite/refactor all the JavaScript and CSS
- Make it a little prettier
- Better error handling
- LoRA controls
- Multiple chat modes with prompt templates (instruct, etc.)
??
- Support for other quantization methods
- Support for other LLM architectures
- Allow for backpropagation
- LoRA training features
- Soft prompt training