metadata

license: gemma
datasets:
  - flytech/python-codes-25k
widget:
  - text: write a simple python function
    example_title: Example 1
  - text: write a python program using flask
    example_title: Example 2
  - text: make a todo list using python
    example_title: Example 3
  - text: print current date and time using python
    example_title: Example 4
language:
  - en
pipeline_tag: text-generation

Gemma-2b-it-finetuned-python-codes

This model card corresponds to the 2B finetuned version of the Gemma-2b-it model. You can visit the model card of the 2B Gemma Instruct.

Author: Dishank Shah

Description

GifPC-2b (Gemma-2b-it-finetuned-python-codes) LLM is trained on a dataset containing Python code snippets. This specialized training aimed to enhance Gemma-2b-it's understanding of Python syntax, semantics, and common programming patterns. With this finetuning, Gemma-2b-it is now proficient in not only comprehending Python code but also capable of assisting in debugging tasks. Users can leverage its trained knowledge to seek guidance on Python-related issues, understand code logic, and identify potential errors within their programs. This specialized Gemma-2b-it variant serves as a valuable tool for programmers seeking assistance and guidance in Python programming and debugging tasks.

Usage

Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers, then copy the snippet from the section that is relevant for your usecase.

Running the model on Google Colab CPU

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "shahdishank/gemma-2b-it-finetune-python-codes"
HUGGING_FACE_TOKEN = "YOUR_TOKEN"
tokenizer = AutoTokenizer.from_pretrained(model_name, token="HUGGING_FACE_TOKEN")
model = AutoModelForCausalLM.from_pretrained(model_name, token="HUGGING_FACE_TOKEN")

prompt_template = """\
  user:\n{query} \n\n assistant:\n
  """
prompt = prompt_template.format(query="write a simple python function") # write your query here

input_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
outputs = model.generate(**input_ids, max_new_tokens=2000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Model Data

Data used for model training python-codes-25k.

Training Dataset

These models were trained on a dataset of text data that includes a wide variety of python codes. Here are the key components:

Instruction: The instructional task to be performed / User input.
Input: Very short, introductive part of AI response or empty.
Output: Python code that accomplishes the task.
Text: All fields combined together.

This diverse data source is crucial for training a powerful language model that can handle a wide variety of different tasks.

Usage

This LLM can be used for:

Code generation
Debugging
Learn and understand various python coding styles