Kwaipilot OASIS-1.3B

Model Details

Model Name: OASIS (Optimized Augmentation Strategy for Improved code Search)

Introduction

OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function, setting new benchmarks in code search efficiency and accuracy.

Intended Use

This model is ideal for developers and researchers engaged in enhancing code retrieval systems. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.

Training and Performance

OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.

Future Directions

Kwaipilot upcoming initiatives include:

Open sourcing improved models.
Releasing technical reports.
Releasing natural language processing models.
...

Performance

	Size	CoSQA	AdvTest	CSN-Py	CSN-Ja	CSN-JS	CSN-PHP	CSN-Go	CSN-Ruby	Avg
Openai-Embedding-Ada-002	Unknown	0.4423	0.3808	0.6802	0.7149	0.6750	0.6062	0.8563	0.7472	0.6378
jina-embeddings-v2-base-code	161M	0.6837	0.385	0.6634	0.6803	0.6304	0.5701	0.8595	0.7095	0.6477
CodeSage-large	1.3B	0.4753	0.5267	0.7077	0.7021	0.695	0.6133	0.8371	0.7192	0.6595
CodeFuse-CGE-Small	3.8B	0.5619	0.4639	0.6958	0.6863	0.6564	0.6133	0.8637	0.7341	0.6594
OASIS-1.3B	1.3B	0.5532	0.4861	0.7110	0.7199	0.6727	0.6217	0.8732	0.7333	0.6713

Usage

Direct Usage

pip install -U torch
pip install -U transformers

Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoModel, AutoTokenizer

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

# Add query prompt
def get_query_prompt(query: str):
    query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
    prompt = f'Instruct: {query_description}\nQuery: {query}'
    return prompt

query = "How to do quicksort in python?"

code1 = """def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(1, n - i):
            if arr[j - 1] > arr[j]:
                arr[j - 1], arr[j] = arr[j], arr[j - 1]
                swapped = True
        if not swapped:
            break
    return arr"""

code2 = """def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quick_sort(less) + [pivot] + quick_sort(greater)"""

model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")

# Tokenize and inference
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**inputs)

# Last token pooling
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
print(embeddings.shape)
# torch.Size([3, 2048])

embeddings = F.normalize(embeddings, dim=1, p=2)
similarity = embeddings @ embeddings.T
print(similarity[0, 1:])
# tensor([0.6495, 0.8036])

Sentence Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})

query = "How to do quicksort in python?"

code1 = """def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(1, n - i):
            if arr[j - 1] > arr[j]:
                arr[j - 1], arr[j] = arr[j], arr[j - 1]
                swapped = True
        if not swapped:
            break
    return arr"""

code2 = """def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quick_sort(less) + [pivot] + quick_sort(greater)"""

# Run inference
query_embedding = model.encode([query], prompt_name="query")
code_embeddings = model.encode([code1, code2])

print(code_embeddings.shape)
# (2, 2048)

# Get the similarity scores for the embeddings
print(model.similarity(query_embedding[0], code_embeddings[0]))
print(model.similarity(query_embedding[0], code_embeddings[1]))
# tensor([[0.6495]])
# tensor([[0.8036]])

BibTeX

@misc{kwaipilotoasis,
  title = {Optimized Augmentation Strategy for Improved code Search},
  author = {Kwaipilot team},
  year = {2024},
}