baiango/384_bit_comp · Hugging Face

Model Description

Machine learning models like tensorflow-compress which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
This model was trained with the dynamic sapient technology, it was SentencePiece unigram with the dataset go_emotion, and it can compress the bits much better than RLE.

Developed by: Ziv Arin
Model type: Sentence similarity lossless compression
License: CC0-1.0

Demo

Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)

The notebook:

import sentencepiece as spm

bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')

def encode_id(bit_text):
    encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
    encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
    assert any([id_ <= 255 for id_ in encoded_ids])
    string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
    return string_ids

def decode_id(hex_string):
    u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
    encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
    return encoded_tokens

# Encode text
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
encoded_ids = encode_id(new_sentence)
decoded_tokens = decode_id(encoded_ids)

print("length:", len(encoded_tokens))
print("encoded_tokens:", encoded_tokens)
print("encoded_ids:", encoded_ids)
print("same?:", encoded_tokens == decoded_tokens)

count = Counter(encoded_tokens)
print("count:", count)

Output:

length: 13
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
encoded_ids: 1ab2ed09d7a9617206894e0608
same?: True
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})

Bias, Risks, and Limitations

It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
The model doesn't compress well strings with fewer zeros.

Environmental Impact

Hardware Type: I5-9300H
Hours used: 3 hours

baiango
/

384_bit_comp

Model Description

Demo

Bias, Risks, and Limitations

Environmental Impact

Dataset used to train baiango/384_bit_comp