stereoplegic 's Collections Tokenizer
updated
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper
• 2310.05737
• Published
• 6
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models
Paper
• 2308.16692
• Published
• 1
Towards General Text Embeddings with Multi-stage Contrastive Learning
Paper
• 2308.03281
• Published
• 3
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via
Tool Embeddings
Paper
• 2305.11554
• Published
• 2
Diversifying Joint Vision-Language Tokenization Learning
Paper
• 2306.03421
• Published
• 2
Joint Adaptive Representations for Image-Language Learning
Paper
• 2305.19924
• Published
• 1
Tokenizer Choice For LLM Training: Negligible or Crucial?
Paper
• 2310.08754
• Published
• 3
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Paper
• 2311.04589
• Published
• 21
Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning
Paper
• 2309.08708
• Published
• 3
Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic
Representations
Paper
• 2311.04335
• Published
• 1
From Words to Music: A Study of Subword Tokenization Techniques in
Symbolic Music Generation
Paper
• 2304.08953
• Published
• 2
Assessing the Importance of Frequency versus Compositionality for
Subword-based Tokenization in NMT
Paper
• 2306.01393
• Published
• 1
Tokenization with Factorized Subword Encoding
Paper
• 2306.07764
• Published
• 1
DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence
Modeling
Paper
• 1911.12385
• Published
• 1
Parameter-Efficient Tuning with Special Token Adaptation
Paper
• 2210.04382
• Published
• 2
From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding
Paper
• 2305.14571
• Published
• 1
Nomic Embed: Training a Reproducible Long Context Text Embedder
Paper
• 2402.01613
• Published
• 15
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Paper
• 2310.11628
• Published
Word-Level Representation From Bytes For Language Modeling
Paper
• 2211.12677
• Published
Multi-Word Tokenization for Sequence Compression
Paper
• 2402.09949
• Published
Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages
Paper
• 2305.17179
• Published
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient
Large-scale Multilingual Continued Pretraining
Paper
• 2311.08849
• Published
• 6
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
Paper
• 2103.06874
• Published
• 2
Zero-Shot Tokenizer Transfer
Paper
• 2405.07883
• Published
• 5
Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models
Paper
• 2403.00417
• Published
• 3
Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs
Paper
• 2402.14903
• Published
MAGNET: Improving the Multilingual Fairness of Language Models with
Adaptive Gradient-Based Tokenization
Paper
• 2407.08818
• Published