nano-mistral / README.md

crumb

Update README.md

c988b8c verified 7 months ago

preview code

raw

history blame contribute delete

No virus

9.94 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- crumb/askmistral-pile-2-15
	language:
	- en
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: me
	- Model type: Mistral
	- Language(s) (NLP): en
	- License: apache

	## Uses

	general web text completions at extremely low resource use

	### Out-of-Scope Use

	not an instruct model

	## Bias, Risks, and Limitations

	trained on web text, though filtered no guarantees theres not toxic stuff in there

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("crumb/nano-mistral")
	tokenizer = AutoTokenizer.from_pretrained("crumb/nano-mistral")

	inputs = tokenizer(["Once upon a time,"], return_tensors="pt")
	inputs = {k:v.to(model.device) for k,v in dict(inputs).items()}
	outputs = model.generate(inputs, max_new_tokens=128, temperature=0.7, top_k=20, do_sample=True)
	outputs = tokenizer.batch_decode(outputs)
	for i in outputs:
	print(i)
	```

	## Training Details

	### Training Data

	[crumb/askmistral-pile-2-15](https://huggingface.co/datasets/crumb/askmistral-pile-2-15)

	### Training Procedure

	\| Parameter \| Value \|
	\| - \| - \|
	\| Context Length \| 2048 \|
	\| Batch Size \| 128 \|
	\| Learning Rate \| 6e-4 \|
	\| Scheduler \| One-Cycle \|
	\| Adam eps \| 1e-8 \|
	\| Adam beta1 \| 0.9 \|
	\| Adam beta2 \| 0.95 \|
	\| Weight Decay \| 0.1 \|
	\| Max Grad Norm \| 1.0 \|
	\| Optimizer \| adamw_torch \|
	\| Tokens \| 3,401,640,960 \|

	#### Preprocessing [optional]

	[More Information Needed]


	#### Training Hyperparameters

	- Training regime: bf16 non-mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	train_runtime 62541.9424

	train_samples_per_second 26.557

	[More Information Needed]

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	held out set of [crumb/askmistral-pile-2-15](https://huggingface.co/datasets/crumb/askmistral-pile-2-15)

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	open llm leaderboard eval datasets and settings

	### Results

	OpenLLM Leaderboard Mean Score + Stderr:
	(29.30, 0.42)

	\| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|-----:\|---\|-----:\|
	\|arc_challenge\| 1\|none \| 25\|acc \|0.1843\|± \|0.0113\|
	\| \| \|none \| 25\|acc_norm\|0.2167\|± \|0.0120\|
	\|truthfulqa_mc2\| 2\|none \| 0\|acc \|0.4719\|± \|0.0156\|
	\|winogrande\| 1\|none \| 5\|acc \|0.517\|± \| 0.014\|
	\|hellaswag\| 1\|none \| 10\|acc \|0.2803\|± \|0.0045\|
	\| \| \|none \| 10\|acc_norm\|0.2886\|± \|0.0045\|
	\|gsm8k\| 3\|strict-match \| 5\|exact_match\|0.0008\|± \|0.0008\|
	\| \| \|flexible-extract\| 5\|exact_match\|0.0099\|± \|0.0027\|

	#### MMLU

	value, stderr = (0.253980701754386, 0.004428598058450528)
	\| Tasks \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\|
	\|-----------------------------------\|------:\|------\|-----:\|------\|-----:\|---\|-----:\|
	\|world_religions \| 0\|none \| 5\|acc \|0.2222\|± \|0.0319\|
	\|virology \| 0\|none \| 5\|acc \|0.2711\|± \|0.0346\|
	\|us_foreign_policy \| 0\|none \| 5\|acc \|0.3300\|± \|0.0473\|
	\|sociology \| 0\|none \| 5\|acc \|0.2388\|± \|0.0301\|
	\|security_studies \| 0\|none \| 5\|acc \|0.2367\|± \|0.0272\|
	\|public_relations \| 0\|none \| 5\|acc \|0.2273\|± \|0.0401\|
	\|professional_psychology \| 0\|none \| 5\|acc \|0.2484\|± \|0.0175\|
	\|professional_medicine \| 0\|none \| 5\|acc \|0.4596\|± \|0.0303\|
	\|professional_law \| 0\|none \| 5\|acc \|0.2464\|± \|0.0110\|
	\|professional_accounting \| 0\|none \| 5\|acc \|0.2021\|± \|0.0240\|
	\|prehistory \| 0\|none \| 5\|acc \|0.2130\|± \|0.0228\|
	\|philosophy \| 0\|none \| 5\|acc \|0.2219\|± \|0.0236\|
	\|nutrition \| 0\|none \| 5\|acc \|0.2157\|± \|0.0236\|
	\|moral_scenarios \| 0\|none \| 5\|acc \|0.2380\|± \|0.0142\|
	\|moral_disputes \| 0\|none \| 5\|acc \|0.2486\|± \|0.0233\|
	\|miscellaneous \| 0\|none \| 5\|acc \|0.2516\|± \|0.0155\|
	\|medical_genetics \| 0\|none \| 5\|acc \|0.3000\|± \|0.0461\|
	\|marketing \| 0\|none \| 5\|acc \|0.2265\|± \|0.0274\|
	\|management \| 0\|none \| 5\|acc \|0.1748\|± \|0.0376\|
	\|machine_learning \| 0\|none \| 5\|acc \|0.3125\|± \|0.0440\|
	\|logical_fallacies \| 0\|none \| 5\|acc \|0.2393\|± \|0.0335\|
	\|jurisprudence \| 0\|none \| 5\|acc \|0.2315\|± \|0.0408\|
	\|international_law \| 0\|none \| 5\|acc \|0.3140\|± \|0.0424\|
	\|human_sexuality \| 0\|none \| 5\|acc \|0.2519\|± \|0.0381\|
	\|human_aging \| 0\|none \| 5\|acc \|0.3049\|± \|0.0309\|
	\|high_school_world_history \| 0\|none \| 5\|acc \|0.2658\|± \|0.0288\|
	\|high_school_us_history \| 0\|none \| 5\|acc \|0.2451\|± \|0.0302\|
	\|high_school_statistics \| 0\|none \| 5\|acc \|0.4722\|± \|0.0340\|
	\|high_school_psychology \| 0\|none \| 5\|acc \|0.1963\|± \|0.0170\|
	\|high_school_physics \| 0\|none \| 5\|acc \|0.3046\|± \|0.0376\|
	\|high_school_microeconomics \| 0\|none \| 5\|acc \|0.2773\|± \|0.0291\|
	\|high_school_mathematics \| 0\|none \| 5\|acc \|0.2667\|± \|0.0270\|
	\|high_school_macroeconomics \| 0\|none \| 5\|acc \|0.2667\|± \|0.0224\|
	\|high_school_government_and_politics\| 0\|none \| 5\|acc \|0.2591\|± \|0.0316\|
	\|high_school_geography \| 0\|none \| 5\|acc \|0.2424\|± \|0.0305\|
	\|high_school_european_history \| 0\|none \| 5\|acc \|0.2242\|± \|0.0326\|
	\|high_school_computer_science \| 0\|none \| 5\|acc \|0.2800\|± \|0.0451\|
	\|high_school_chemistry \| 0\|none \| 5\|acc \|0.2857\|± \|0.0318\|
	\|high_school_biology \| 0\|none \| 5\|acc \|0.3129\|± \|0.0264\|
	\|global_facts \| 0\|none \| 5\|acc \|0.1500\|± \|0.0359\|
	\|formal_logic \| 0\|none \| 5\|acc \|0.1905\|± \|0.0351\|
	\|elementary_mathematics \| 0\|none \| 5\|acc \|0.2513\|± \|0.0223\|
	\|electrical_engineering \| 0\|none \| 5\|acc \|0.2759\|± \|0.0372\|
	\|econometrics \| 0\|none \| 5\|acc \|0.2456\|± \|0.0405\|
	\|conceptual_physics \| 0\|none \| 5\|acc \|0.2638\|± \|0.0288\|
	\|computer_security \| 0\|none \| 5\|acc \|0.1800\|± \|0.0386\|
	\|college_physics \| 0\|none \| 5\|acc \|0.2549\|± \|0.0434\|
	\|college_medicine \| 0\|none \| 5\|acc \|0.2023\|± \|0.0306\|
	\|college_mathematics \| 0\|none \| 5\|acc \|0.2900\|± \|0.0456\|
	\|college_computer_science \| 0\|none \| 5\|acc \|0.2700\|± \|0.0446\|
	\|college_chemistry \| 0\|none \| 5\|acc \|0.2500\|± \|0.0435\|
	\|college_biology \| 0\|none \| 5\|acc \|0.2222\|± \|0.0348\|
	\|clinical_knowledge \| 0\|none \| 5\|acc \|0.2377\|± \|0.0262\|
	\|business_ethics \| 0\|none \| 5\|acc \|0.2100\|± \|0.0409\|
	\|astronomy \| 0\|none \| 5\|acc \|0.1776\|± \|0.0311\|
	\|anatomy \| 0\|none \| 5\|acc \|0.2593\|± \|0.0379\|
	\|abstract_algebra \| 0\|none \| 5\|acc \|0.2200\|± \|0.0416\|

	#### Summary

	## Model Examination [optional]

	its ok

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: A6000
	- Hours used: 34.74
	- Cloud Provider: n/a
	- Compute Region iowa
	- Carbon Emitted: 4.5kg CO2eq.

	## Technical Specifications [optional]

	### Model Architecture and Objective

	mistral, causal language modelling

	### Compute Infrastructure

	what

	#### Hardware

	lambda vector 2xA6000

	#### Software

	huggingface transformers / pytorch / custom trainer

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]