Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Spaces:

juanma1907
/

la-llama-que-llama

Paused

App Files Files Community

la-llama-que-llama / modules /exllama /doc /TODO.md

juanma1907's picture

Upload folder using huggingface_hub

462dacf over 1 year ago

|

history blame contribute delete

1.07 kB

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

Model compatibility

Verify compatibility with Llama-2 34B once released

GPU compatibility (etc.)

Optimizations for ROCm
Optimizations for RTX 20-series maybe
Look into improving P40 performance

Testing

More testing on Llama 2 models

Optimization

Flash Attention 2.0 (?)
Find a way to eliminate ExLlamaAttention.repeat_kv (custom attention kernel?)
C++ implementations of sampler functions

Generation

Optimized/batched beam search
Allow stackable LoRAs
Guidance or equivalent

Interface

Comprehensive API server (more than example_flask.py

Web UI

Controls to enable beam search
Rewrite/refactor all the JavaScript and CSS
Make it a little prettier
Better error handling
LoRA controls
Multiple chat modes with prompt templates (instruct, etc.)

??

Support for other quantization methods
Support for other LLM architectures
Allow for backpropagation
LoRA training features
Soft prompt training