Spaces:

xu-song
/

tokenizer-arena

Running

xu-song commited on Jul 13, 2024

Commit

47e1616

1 Parent(s): 0d55475

add doc

Files changed (2) hide show

compression_app.py CHANGED Viewed

@@ -43,12 +43,12 @@ Lossless tokenization preserves the exact original text, i.e. `decoded_text = in
 - **Compression Rate** <br>
 There are mainly two types of metric to represent the `input_text`:
-  - `byte-level`: the number of bytes in the given text
-  - `char-level`: the number of characters in the given text.
-To evaluate compression rate, simple metrics can be "how many bytes per token" or "how many chars per token". <br>
-In this leaderboard, we adopt more frequently used metric: "how many billion tokens per gigabytes corpus" and "how many chars
-per token", i.e. `b_tokens/g_bytes` and `char/token`.
 💬 [Discussions is Welcome](https://huggingface.co/spaces/eson/tokenizer-arena/discussions)
 """
@@ -141,7 +141,11 @@ with gr.Blocks(theme=theme) as demo:
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )
-    gr.Markdown("## 🏆 Compression Rate Leaderboard")
     search_bar = gr.Textbox(
         placeholder="🔍 Search by tokenizer or organization (e.g., 'llama', 'openai') and press ENTER...",
         show_label=False,

 - **Compression Rate** <br>
 There are mainly two types of metric to represent the `input_text`:
+  - `char-level`: the number of characters in the given text
+  - `byte-level`: the number of bytes in the given text.
+To evaluate compression rate, simple metrics can be "how many chars per token" or "how many bytes per token". <br>
+In this leaderboard, we adopt more frequently used metric: "how many chars per token" and
+"how many billion tokens per gigabytes corpus", i.e. `char/token` and `b_tokens/g_bytes`.
 💬 [Discussions is Welcome](https://huggingface.co/spaces/eson/tokenizer-arena/discussions)
 """
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )
+    gr.Markdown("## 🏆 Compression Rate Leaderboard\n"
+                "The leaderboard aim to evaluate tokenizer performance on different languages.\n"
+                "Lower `oov_ratio` refers to less out-of-vocabulary tokens.\n"
+                "Higher `char/token` means less words be segmented into subwords."
+                )
     search_bar = gr.Textbox(
         placeholder="🔍 Search by tokenizer or organization (e.g., 'llama', 'openai') and press ENTER...",
         show_label=False,

compression_util.py CHANGED Viewed

@@ -295,9 +295,12 @@ def get_compression_leaderboard(
     if return_type == "dataframe":
         token_number_unit, file_size_unit = unit.split("/")
         reverse_unit = f"{file_size_unit}/{token_number_unit}"
-        stats = to_dataframe(stats, [unit, reverse_unit, "char/token"])
-        stats = stats.sort_values(["oov_ratio", unit], ascending=[True, True])
-        stats = stats.rename(columns={"oov_ratio": f' ⬆️oov_ratio'}).rename(columns={unit: f' ⬆️{unit}'})  # ⬇
     return stats

     if return_type == "dataframe":
         token_number_unit, file_size_unit = unit.split("/")
         reverse_unit = f"{file_size_unit}/{token_number_unit}"
+        stats = to_dataframe(stats, ["char/token", unit, reverse_unit])
+        stats = stats.sort_values(["oov_ratio", "char/token"], ascending=[True, False])
+        # stats = stats.sort_values(["oov_ratio", unit], ascending=[True, True])
+        stats = stats.rename(columns={"oov_ratio": f' ⬆️oov_ratio'}).rename(columns={"char/token": ' ⬇️char/token'})  #
     return stats