Model Description
Machine learning models like tensorflow-compress which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
This model was trained with the dynamic sapient technology, it was SentencePiece unigram with the dataset go_emotion, and it can compress the bits much better than RLE.
- Developed by: Ziv Arin
- Model type: Sentence similarity lossless compression
- License: CC0-1.0
Demo
Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)
import sentencepiece as spm
bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')
def encode_id(bit_text):
encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
assert any([id_ <= 255 for id_ in encoded_ids])
string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
return string_ids
def decode_id(hex_string):
u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
return encoded_tokens
# Encode text
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
encoded_ids = encode_id(new_sentence)
decoded_tokens = decode_id(encoded_ids)
print("length:", len(encoded_tokens))
print("encoded_tokens:", encoded_tokens)
print("encoded_ids:", encoded_ids)
print("same?:", encoded_tokens == decoded_tokens)
count = Counter(encoded_tokens)
print("count:", count)
Output:
length: 13
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
encoded_ids: 1ab2ed09d7a9617206894e0608
same?: True
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
Bias, Risks, and Limitations
It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
The model doesn't compress well strings with fewer zeros.
Environmental Impact
- Hardware Type: I5-9300H
- Hours used: 3 hours