Model Card — J-Raposo/code-search-net-tokenizer

Model name

J-Raposo/code-search-net-tokenizer

Short description

A GPT-2–style tokenizer (byte-level BPE) retrained on the CodeSearchNet (Python) dataset to better tokenize Python source code (identifiers, punctuation, docstrings, and common code tokens). Trained following the Hugging Face LLM course (Chapter 6, Section 1 — training/retraining a GPT-2 tokenizer).


Model details

  • Author: J-Raposo (Hugging Face username: J-Raposo)
  • Model type: Tokenizer (Byte-level BPE, GPT-2 style)
  • Language(s): Python (source code), English (comments & docstrings)
  • License: [To be defined by repo owner — e.g., mit, apache-2.0]
  • Intended use: Tokenization for code modeling tasks (code search, code completion, summarization, classification, and fine-tuning code LLMs on Python).
  • Not intended for: Producing runnable or secure code without downstream model fine-tuning; this tokenizer only affects tokenization behavior, not model logic or correctness of generated code.

Summary

This tokenizer is a byte-level BPE tokenizer (GPT-2 style) retrained on the CodeSearchNet python subset (loaded with datasets.load_dataset("code_search_net", "python")). It aims to produce more meaningful sub-token splits for Python source code by (a) preserving punctuation and operators as informative tokens, (b) reducing excessive fragmentation of common identifiers and API names, and (c) handling docstrings and comments so that natural language context is preserved for downstream models.


Training data

  • Dataset: CodeSearchNet — python subset (loaded via datasets.load_dataset("code_search_net", "python")).
  • Preprocessing: Source files and docstrings were extracted. Common normalization steps applied (e.g., newline normalization). Comments and docstrings were retained to preserve natural language context alongside code.
  • Notes: Tokenizer was trained only on the Python portion; tokenization quality for other languages (JavaScript, Java, C, etc.) may be lower.

Tokenizer details / configuration

  • Tokenizer type: Byte-level BPE (GPT-2–style / tokenizers fast API).
  • Vocabulary size: 50,257 (GPT-2 default)
  • Special tokens: standard GPT-2 tokens (e.g., ``) or custom tokens if you added any. Ensure tokenizer_config.json in the repo lists them.
  • Normalization: Byte-level normalization (works with arbitrary byte sequences / UTF-8).
  • Files included: tokenizer.json (preferred tokenizers fast format) or vocab.json + merges.txt (legacy), and tokenizer_config.json.

Uses

Direct Use

  • Tokenize Python code and docstrings for input into language models.
  • Use as a drop-in tokenizer when fine-tuning GPT-2–style or encoder-decoder models for code tasks if they support the tokenizer format.

Downstream Use

  • Fine-tuning code generation or code search LLMs.
  • Preprocessing pipelines for supervised tasks on code (classification, summarization, code-to-text).
  • As a tokenizer for dataset preparation for model pretraining/finetuning on code corpora.

Out-of-Scope Use

  • This tokenizer alone does not produce correct or secure code — it only affects token representation. Use caution when deploying downstream models that generate or modify code; do not rely on tokenization to ensure correctness or security.

Bias, Risks, and Limitations

  • Data bias: The tokenizer reflects distributional properties of public repositories in CodeSearchNet: common libraries, styles, and naming conventions are better represented than niche or private coding styles.
  • Technical limitations: Training on the Python subset causes suboptimal tokenization for other languages. Extremely long or adversarial identifiers may still be split into many sub-tokens.
  • Downstream risks: Tokenization decisions affect model training and generation; poor tokenization can amplify biases or lead to awkward/generated outputs in downstream models. Tokenizers do not mitigate issues like hallucinations, insecure code generation, or toxic outputs.

Recommendations

  • Use this tokenizer for Python-focused models or mixed pipelines where Python is dominant.
  • Evaluate tokenization quality on your downstream tasks (e.g., token length distributions, OOV handling).
  • If you plan to use proprietary source code for training, do not upload proprietary content to public repos — consider training a private tokenizer or using private HF repos.

How to get started (load & use)

Load directly from the Hub after pushing the repo:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("J-Raposo/code-search-net-tokenizer", use_fast=True)

code = "def add(a, b):\n    return a + b"
enc = tokenizer(code, return_tensors="pt")
print(enc)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support