|
--- |
|
base_model: Qwen/Qwen2.5-7B |
|
library_name: peft |
|
language: |
|
- en |
|
license: agpl-3.0 |
|
datasets: |
|
- OramaSearch/nlp-to-query-small |
|
--- |
|
|
|
# Query Translator Mini |
|
|
|
This repository contains a fine-tuned version of Qwen 2.5 7B model specialized in translating natural language queries into structured Orama search queries. |
|
|
|
The model uses PEFT with LoRA to maintain efficiency while achieving high performance. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The Query Translator Mini model is designed to convert natural language queries into structured JSON queries compatible with the Orama search engine. |
|
|
|
It understands various data types and query operators, making it versatile for different search scenarios. |
|
|
|
### Key Features |
|
|
|
- Translates natural language to structured Orama queries |
|
- Supports multiple field types: string, number, boolean, enum, and arrays |
|
- Handles complex query operators: `gt`, `gte`, `lt`, `lte`, `eq`, `between`, `containsAll` |
|
- Supports nested properties with dot notation |
|
- Works with both full-text search and filtered queries |
|
|
|
## Usage |
|
|
|
```python |
|
import json, torch |
|
from peft import PeftModel |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
SYSTEM_PROMPT = """ |
|
You are a tool used to generate synthetic data of Orama queries. Orama is a full-text, vector, and hybrid search engine. |
|
|
|
Let me show you what you need to do with some examples. |
|
|
|
Example: |
|
- Query: `"What are the red wines that cost less than 20 dollars?"` |
|
- Schema: `{ "name": "string", "content": "string", "price": "number", "tags": "enum[]" }` |
|
- Generated query: `{ "term": "", "where": { "tags": { "containsAll": ["red", "wine"] }, "price": { "lt": 20 } } }` |
|
|
|
Another example: |
|
- Query: `"Show me 5 prosecco wines good for aperitif"` |
|
- Schema: `{ "name": "string", "content": "string", "price": "number", "tags": "enum[]" }` |
|
- Generated query: `{ "term": "prosecco aperitif", "limit": 5 }` |
|
|
|
One last example: |
|
- Query: `"Show me some wine reviews with a score greater than 4.5 and less than 5.0."` |
|
- Schema: `{ "title": "string", "content": "string", "reviews": { "score": "number", "text": "string" } }]` |
|
- Generated query: `{ "term": "", "where": { "reviews.score": { "between": [4.5, 5.0] } } }` |
|
|
|
The rules to generate the query are: |
|
|
|
- Never use an "embedding" field in the schema. |
|
- Every query has a "term" field that is a string. It represents the full-text search terms. Can be empty (will match all documents). |
|
- You can use a "where" field that is an object. It represents the filters to apply to the documents. Its keys and values depend on the schema of the database: |
|
- If the field is a "string", you should not use operators. Example: `{ "where": { "title": "champagne" } }`. |
|
- If the field is a "number", you can use the following operators: "gt", "gte", "lt", "lte", "eq", "between". Example: `{ "where": { "price": { "between": [20, 100] } } }`. Another example: `{ "where": { "price": { "lt": 20 } } }`. |
|
- If the field is an "enum", you can use the following operators: "eq", "in", "nin". Example: `{ "where": { "tags": { "containsAll": ["red", "wine"] } } }`. |
|
- If the field is an "string[]", it's gonna be just like the "string" field, but you can use an array of values. Example: `{ "where": { "title": ["champagne", "montagne"] } }`. |
|
- If the field is a "boolean", you can use the following operators: "eq". Example: `{ "where": { "isAvailable": true } }`. Another example: `{ "where": { "isAvailable": false } }`. |
|
- If the field is a "enum[]", you can use the following operators: "containsAll". Example: `{ "where": { "tags": { "containsAll": ["red", "wine"] } } }`. |
|
- Nested properties are supported. Just translate them into dot notation. Example: `{ "where": { "author.name": "John" } }`. |
|
- Array of numbers are not supported. |
|
- Array of booleans are not supported. |
|
|
|
Return just a JSON object, nothing more. |
|
""" |
|
|
|
QUERY = "Show me some wine reviews with a score greater than 4.5 and less than 5.0." |
|
|
|
SCHEMA = { |
|
"title": "string", |
|
"description": "string", |
|
"price": "number", |
|
} |
|
|
|
base_model_name = "Qwen/Qwen2.5-7B" |
|
adapter_path = "OramaSearch/query-translator-mini" |
|
|
|
print("Loading tokenizer...") |
|
tokenizer = AutoTokenizer.from_pretrained(base_model_name) |
|
|
|
print("Loading base model...") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model_name, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
trust_remote_code=True, |
|
) |
|
|
|
print("Loading fine-tuned adapter...") |
|
model = PeftModel.from_pretrained(model, adapter_path) |
|
|
|
if torch.cuda.is_available(): |
|
model = model.cuda() |
|
print(f"GPU memory after loading: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB") |
|
|
|
messages = [ |
|
{"role": "system", "content": SYSTEM_PROMPT}, |
|
{"role": "user", "content": f"Query: {QUERY}\nSchema: {json.dumps(SCHEMA)}"}, |
|
] |
|
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=512, |
|
do_sample=True, |
|
temperature=0.1, |
|
top_p=0.9, |
|
num_return_sequences=1, |
|
) |
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
## Training Details |
|
|
|
The model was trained on a NVIDIA H100 SXM using the following configuration: |
|
|
|
- Base Model: Qwen 2.5 7B |
|
- Training Method: LoRA |
|
- Quantization: 4-bit quantization using bitsandbytes |
|
- LoRA Configuration: |
|
- Rank: 16 |
|
- Alpha: 32 |
|
- Dropout: 0.1 |
|
- Target Modules: Attention layers and MLP |
|
|
|
- Training Arguments: |
|
- Epochs: 3 |
|
- Batch Size: 2 |
|
- Learning Rate: 5e-5 |
|
- Gradient Accumulation Steps: 8 |
|
- FP16 Training: Enabled |
|
- Gradient Checkpointing: Enabled |
|
|
|
## Supported Query Types |
|
|
|
The model can handle various types of queries including: |
|
|
|
1. Simple text search: |
|
|
|
```json |
|
{ |
|
"term": "prosecco aperitif", |
|
"limit": 5 |
|
} |
|
``` |
|
|
|
2. Numeric range queries: |
|
|
|
```json |
|
{ |
|
"term": "", |
|
"where": { |
|
"price": { |
|
"between": [20, 100] |
|
} |
|
} |
|
} |
|
``` |
|
|
|
3. Tag-based filtering: |
|
|
|
```json |
|
{ |
|
"term": "", |
|
"where": { |
|
"tags": { |
|
"containsAll": ["red", "wine"] |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## Limitations |
|
|
|
- Does not support array of numbers or booleans |
|
- Maximum input length is 1024 tokens |
|
- Embedding fields are not supported in the schema |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
``` |
|
@misc{query-translator-mini, |
|
author = {OramaSearch Inc.}, |
|
title = {Query Translator Mini: Natural Language to Orama Query Translation}, |
|
year = {2024}, |
|
publisher = {HuggingFace}, |
|
journal = {HuggingFace Repository}, |
|
howpublished = {\url{https://huggingface.co/OramaSearch/query-translator-mini}} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
AGPLv3 |
|
|
|
## Acknowledgments |
|
|
|
This model builds upon the Qwen 2.5 7B model and uses techniques from the PEFT library. Special thanks to the teams behind these projects. |