Post
761
š š¶š»š¶š š®š
'š š»š²š š š¼š ššš šæš²š®š°šµš²š šš¹š®šš±š²-š¦š¼š»š»š²š š¹š²šš²š¹ šš¶ššµ š°š šš¼šøš²š»š š°š¼š»šš²š
š š¹š²š»š“ššµ š„
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
šš²š š¶š»šš¶š“šµšš:
šļø MoE with novel hybrid attention:
ā£ Mixture of Experts with 456B total parameters (45.9B activated per token)
ā£ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
š Outperforms leading models across benchmarks while offering vastly longer context:
ā£ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
ā£ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
š¬ Technical innovations enable efficient scaling:
ā£ Novel expert parallel and tensor parallel strategies cut communication overhead in half
ā£ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
šÆ Thorough training strategy:
ā£ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! š
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here š MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users š MiniMaxAI/MiniMax-Text-01
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
šš²š š¶š»šš¶š“šµšš:
šļø MoE with novel hybrid attention:
ā£ Mixture of Experts with 456B total parameters (45.9B activated per token)
ā£ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
š Outperforms leading models across benchmarks while offering vastly longer context:
ā£ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
ā£ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
š¬ Technical innovations enable efficient scaling:
ā£ Novel expert parallel and tensor parallel strategies cut communication overhead in half
ā£ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
šÆ Thorough training strategy:
ā£ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! š
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here š MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users š MiniMaxAI/MiniMax-Text-01