a collection of pre-training corpora refined by ProX
			
	
	AI & ML interests
NLP Research
			Organization Card
		
		
GAIR-ProX, a subsidiary of GAIR, spearheads the 🫐 ProX Project. This initiative aims to enhance pre-training efficiency by refining corpus documents using language models at scale. Through meticulous operations (e.g., document-level filtering and chunk-level cleaning), implemented as scalable, executable programs, 🫐 ProX seeks to improve pre-training data quality at scale, ultimately developing more robust and efficient language models.
Read our technical report!
			models
			14
		
			
	
	
	
	
	gair-prox/web-chunk-refining-lm
			Text Generation
			• 
		
				0.4B
			• 
	
				Updated
					
				
				• 
					
					5
				
	
				• 
					
					6
				
gair-prox/math-chunk-refining-lm
			Text Generation
			• 
		
				0.4B
			• 
	
				Updated
					
				
				• 
					
					5
				
	
				• 
					
					1
				
gair-prox/math-doc-refining-lm
			Text Generation
			• 
		
				0.8B
			• 
	
				Updated
					
				
				• 
					
					1
				
	
				• 
					
					2
				
gair-prox/web-doc-refining-lm
			Text Generation
			• 
		
				0.4B
			• 
	
				Updated
					
				
				• 
					
					4
				
	
				• 
					
					5
				
gair-prox/RedPJ-ProX-1.7B
		
				2B
			• 
	
				Updated
					
				
				
				
	
				• 
					
					2
				
gair-prox/RedPJ-ProX-0.3B
		
				0.4B
			• 
	
				Updated
					
				
				• 
					
					1
				
	
				• 
					
					3
				
gair-prox/C4-ProX-1.7B
		
				2B
			• 
	
				Updated
					
				
				• 
					
					3
				
	
				• 
					
					1
				
gair-prox/CodeLlama-7B-ProXMath
		
	
				Updated
					
				
				
				
	
				• 
					
					1
				
gair-prox/TinyLlama-1.1B-ProXMath
		
				1B
			• 
	
				Updated
					
				
				• 
					
					6
				
	
				• 
					
					2
				
gair-prox/Llama-2-7B-ProXMath
			Text Generation
			• 
		
	
				Updated
					
				
				• 
					
					1
				
	
				• 
					
					1