Introducing our first standalone model β FluentlyLM Prinum
Introducing the first standalone model from Project Fluently LM! We worked on it for several months, used different approaches and eventually found the optimal one.
General characteristics: - Model type: Causal language models (QwenForCausalLM, LM Transformer) - Number of parameters: 32.5B - Number of parameters (not embedded): 31.0B - Number of layers: 64 - Context: 131,072 tokens - Language(s) (NLP): English, French, Spanish, Russian, Chinese, Japanese, Persian (officially supported) - License: MIT
Creation strategy: The basis of the strategy is shown in Pic. 2. We used Axolotl & Unsloth for SFT-finetuning with PEFT LoRA (rank=64, alpha=64) and Mergekit for SLERP and TIES mergers.
βοΈ Ultraset - all-in-one dataset for SFT training in Alpaca format. fluently-sets/ultraset
β Ultraset is a comprehensive dataset for training Large Language Models (LLMs) using the SFT (instruction-based Fine-Tuning) method. This dataset consists of over 785 thousand entries in eight languages, including English, Russian, French, Italian, Spanish, German, Chinese, and Korean.
π€― Ultraset solves the problem faced by users when selecting an appropriate dataset for LLM training. It combines various types of data required to enhance the model's skills in areas such as text writing and editing, mathematics, coding, biology, medicine, finance, and multilingualism.
π€ For effective use of the dataset, it is recommended to utilize only the "instruction," "input," and "output" columns and train the model for 1-3 epochs. The dataset does not include DPO or Instruct data, making it suitable for training various types of LLM models.
βοΈ Ultraset is an excellent tool to improve your language model's skills in diverse knowledge areas.
a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
JoseRFJunior/TransNAR https://github.com/JoseRFJuniorLLMs/TransNAR https://arxiv.org/html/2406.09308v1 TransNAR hybrid architecture. Similar to Alayrac et al, we interleave existing Transformer layers with gated cross-attention layers which enable information to flow from the NAR to the Transformer. We generate queries from tokens while we obtain keys and values from nodes and edges of the graph. The node and edge embeddings are obtained by running the NAR on the graph version of the reasoning task to be solved. When experimenting with pre-trained Transformers, we initially close the cross-attention gate, in order to fully preserve the language modelβs internal knowledge at the beginning of training.