Spaces:

juanma1907
/

la-llama-que-llama

Paused

File size: 1,069 Bytes

462dacf

## Model compatibility

- [ ] Verify compatibility with Llama-2 34B once released

## GPU compatibility (etc.)

- [ ] Optimizations for ROCm
- [ ] Optimizations for RTX 20-series maybe
- [ ] Look into improving P40 performance

## Testing

- [ ] More testing on Llama 2 models

## Optimization

- [ ] Flash Attention 2.0 (?)
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
- [ ] C++ implementations of sampler functions

## Generation

- [ ] Optimized/batched beam search
- [ ] Allow stackable LoRAs
- [ ] Guidance or equivalent

## Interface

- [ ] Comprehensive API server (more than `example_flask.py`

## Web UI

- [ ] Controls to enable beam search
- [ ] Rewrite/refactor all the JavaScript and CSS
- [ ] Make it a little prettier
- [ ] Better error handling
- [ ] LoRA controls
- [ ] Multiple chat modes with prompt templates (instruct, etc.)

## ??

- [ ] Support for other quantization methods
- [ ] Support for other LLM architectures
- [ ] Allow for backpropagation
- [ ] LoRA training features
- [ ] Soft prompt training