WordLLama - Indic
Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli to train this model.
Weights and tokenizer is dereived from sarvam-1, For license terms refer to https://huggingface.co/sarvamai/sarvam-1.
How to use.
Install fork of WordLlama,
pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git
Download the weights and tokenizer,
git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic
Code can be used like this,
from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
from safetensors import safe_open
import toml
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
embedding = f.get_tensor('embedding.weight').numpy()
config_file = "sarvam1_2b.toml"
config_data = toml.load(config_file)
config_data["config_name"] = "sarvam1_2b"
config = WordLlamaConfig(**config_data)
wl = WordLlamaInference(
embedding=embedding,
tokenizer=tokenizer,
config=config,
binary=False,
)
# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score) # Output: e.g., 0.0664
# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Calculate similarity between two sentences in Tamil
similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
print(similarity_score) # Output: e.g., 0.075
# Rank documents based on their similarity to a Tamil query
query = "நான் கார் சென்றேன்"
candidates = [
"நான் பூங்காவிற்கு சென்றேன்",
"நான் கடைக்கு சென்றேன்",
"நான் லாரி சென்றேன்",
"நான் வாகனத்தில் சென்றேன்"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
query = "నేను కారులో వెళ్లాను"
candidates = [
"నేను పార్క్కి వెళ్లాను",
"నేను మార్కెట్కి వెళ్లాను",
"నేను లారీలో వెళ్లాను",
"నేను వాహనంలో వెళ్లాను"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
Model tree for tinisoft/wordllama-indic
Base model
sarvamai/sarvam-1