A little guide to building Large Language Models in 2024
Resources mentioned by @thomwolf in https://x.com/Thom_Wolf/status/1773340316835131757
Paper • 2403.04652 • Published • 62Note checkout their chat space: https://huggingface.co/spaces/01-ai/Yi-34B-Chat
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Paper • 2402.00159 • Published • 59Note checkout olmo suite: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Paper • 2306.01116 • Published • 31Note checkout datatrove: https://github.com/huggingface/datatrove (freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.)
Bag of Tricks for Efficient Text Classification
Paper • 1607.01759 • PublishedNote read more: https://fasttext.cc/
Breadth-First Pipeline Parallelism
Paper • 2211.05953 • PublishedNote checkout: https://github.com/huggingface/nanotron (minimalistic large language model 3D-parallelism training)
Reducing Activation Recomputation in Large Transformer Models
Paper • 2205.05198 • PublishedSequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Paper • 2203.03466 • Published • 1Note from creators of grok: https://huggingface.co/xai-org/grok-1
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
Paper • 2304.03208 • Published • 1
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 138Note checkout transformers compatible mambas: https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 48Note checkout https://huggingface.co/docs/trl (train transformer language models with reinforcement learning.)
Runtime error125🪁Zephyr Gemma Chat
Note checkout https://github.com/huggingface/alignment-handbook (robust recipes to align language models with human and AI preferences)
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Paper • 2402.14740 • Published • 11
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Paper • 2210.17323 • Published • 8Note read more: https://huggingface.co/blog/gptq-integration (Making LLMs lighter with AutoGPTQ and transformers)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper • 2208.07339 • Published • 4Note read more: https://huggingface.co/docs/bitsandbytes (accessible large language models via k-bit quantization for PyTorch)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Paper • 2401.10774 • Published • 54- Running74🐠
Qwen1.5 MoE A2.7B Chat Demo
Running on CPU Upgrade11.8k🏆Open LLM Leaderboard 2
Track, rank and evaluate open LLMs and chatbots
Note checkout lighteval: https://github.com/huggingface/lighteval (lightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron)