35b-beta-long

This release, CausalLM/35b-beta-long, represents the culmination of our experience and accumulated training data in fine-tuning large language models. We are open-sourcing these weights to foster development within the open-source community.

We chose Cohere's multilingual, 35B-parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In our evaluation, it proved to be the most responsive to the quality of training data throughout the Supervised Fine-Tuning process, outperforming other open-source LLMs. Although its initial SFT/RL focuses on specific tasks and comes with a non-commercial license, we believe it's currently the best foundation for personal and internal use cases.

Utilizing extensive factual content from web crawls, we synthesized over 30 million multi-turn dialogue data entries, grounded in multiple web-pages or documents. This process involved substantial human oversight and a data pipeline designed to ensure high quality. The model was then trained on this data in full 128K context using BF16 precision. We also incorporated widely-used open-source dialogue datasets to enhance general conversational fluency.

Our data synthesis approach addressed crucial limitations in typical LLM training corpora. LLMs often struggle to extract thematic summaries, key information, or perform comparisons at the paragraph or document level. Therefore, we focused on generating fact-based data using multiple documents within a long context setting. This involved leveraging existing SOTA LLMs with human guidance to synthesize information through thematic summarization, information extraction, and comparison of source materials.

This approach yielded significant improvements in model performance during fine-tuning. We observed reductions in hallucinations, enhanced long-context capabilities, and improvements in general abilities such as math, coding, and knowledge recall. The training process incorporated both the original source material and the synthesized outputs, further reinforcing the model's ability to recall and utilize abstract concepts embedded within the pre-training data. Our analysis revealed that this combination of original and synthesized data was crucial for achieving a more balanced performance profile. Intermediate checkpoints and models trained solely on synthesized data are also released for research purposes.

Compared to the original task-specific model, our further fine-tuned model demonstrates more robust recall in long-context scenarios without requiring specific document formatting or prompt engineering. This fine-tuned model also exhibits performance comparable to models twice its size in quantifiable benchmarks.

As this model has only undergone SFT, it may still exhibit biases or generate undesirable content. We implemented basic safety measures using open-source refusal datasets to mitigate outputs related to illegal activities, NSFW content, and violence. However, further Reinforcement Learning is necessary for robust alignment with human values.