Spaces:

juanma1907
/

la-llama-que-llama

Paused

Upload folder using huggingface_hub

462dacf over 1 year ago

1.07 kB

	## Model compatibility

	- [ ] Verify compatibility with Llama-2 34B once released

	## GPU compatibility (etc.)

	- [ ] Optimizations for ROCm
	- [ ] Optimizations for RTX 20-series maybe
	- [ ] Look into improving P40 performance

	## Testing

	- [ ] More testing on Llama 2 models

	## Optimization

	- [ ] Flash Attention 2.0 (?)
	- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
	- [ ] C++ implementations of sampler functions

	## Generation

	- [ ] Optimized/batched beam search
	- [ ] Allow stackable LoRAs
	- [ ] Guidance or equivalent

	## Interface

	- [ ] Comprehensive API server (more than `example_flask.py`

	## Web UI

	- [ ] Controls to enable beam search
	- [ ] Rewrite/refactor all the JavaScript and CSS
	- [ ] Make it a little prettier
	- [ ] Better error handling
	- [ ] LoRA controls
	- [ ] Multiple chat modes with prompt templates (instruct, etc.)

	## ??

	- [ ] Support for other quantization methods
	- [ ] Support for other LLM architectures
	- [ ] Allow for backpropagation
	- [ ] LoRA training features
	- [ ] Soft prompt training