ctrltokyo/llama-2-7b-hf-dolly-flash-attention

This model is a fine-tuned version of NousResearch/Llama-2-7b-hf on the databricks/databricks-dolly-15k dataset with all training performed using Flash Attention 2.

No further testing or optimisation has been performed.

Model description

Just like ctrltokyo/llm_prompt_mask_fill_model, this model could be used for live autocompletion of PROMPTS, but is more designed for a generalized chatbot (hence the usage of the Dolly 15k dataset). Don't try this on code, because it won't work. I plan to release a further fine-tuned version using the code_instructions_120k dataset.

Intended uses & limitations

Use as intended.

Training and evaluation data

No evaluation was performed. Trained on NVIDIA A100, but appears to use around 20GB of VRAM when performing inference on the raw model.

Training procedure

The following bitsandbytes quantization config was used during training:

  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: fp4
  • bnb_4bit_use_double_quant: False
  • bnb_4bit_compute_dtype: float32

Framework versions

  • PEFT 0.4.0
Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ctrltokyo/llama-2-7b-hf-dolly-flash-attention