--- datasets: - teknium/openhermes language: - en --- # GZIP Embeddings with Normalized Text It's so funny that the huggingface hub lets you do this | model | parameters | embedding dimensions | | --- | --- | --- | | meta-llama/Llama-2-70b-hf | 70b | 8192 | | crumb/gzip-openhermes | 1* | 242,831 | *the huggingface pretrained model saving api requires at least one parameter, which is set to "1" in this model. multiprocessing is suuuper weird so make sure you dont have the variables "p" or "calculate_ncd_row" in your code anywhere.. ### Usage ```python # Requirements %pip install -qq transformers # Download Model from transformers import AutoModel model = AutoModel.from_pretrained("crumb/gzip-openhermes", trust_remote_code=True) # Prune model model.config.update({ "corpus": model.config.corpus[:1024] }) model.dimensionality() # 1024 # Inference model(["this is a test sequence"], num_procs=16).shape # [1, 1024] # Finetuning from tqdm.auto import tqdm new_data = ["i love GZIP! it is my favorite!", "i HATE transformers!"] normalized_data = [ model.normalize(i) for i in tqdm(new_data) ] print(f"Input: '{new_data[0]}'\nTransformed: '{normalized_data[0]}'") model.config.update({ "corpus": model.config.corpus + normalized_data }) model.dimensionality() model.save_pretrained("my-finetuned-gzip-model") ``` config: ``` normalize = True, normalized_corpus = True, reduction = False, reduced_dimension = 0, remove_stop_words = True, stop_words = stopwords.words('english'), corpus = [], # openhermes instructions + outputs, i think having [instructions, outputs, instructions+outputs] would be better but its literally 3x slower also i dont care ```