--- license: cc0-1.0 datasets: - go_emotions pipeline_tag: sentence-similarity --- ### Model Description Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes. This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE. - **Developed by:** Ziv Arin - **Model type:** Sentence similarity lossless compression - **License:** CC0-1.0 ### Demo Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000 Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency) [The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb) ```py import sentencepiece as spm bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model') def encode_id(bit_text): encoded_pieces = bpe_processor.encode_as_pieces(bit_text) encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces] assert any([id_ <= 255 for id_ in encoded_ids]) string_ids = "".join([format(id_, "02x") for id_ in encoded_ids]) return string_ids def decode_id(hex_string): u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='