Building a Local Vector Database Index with Annoy and Sentence Transformers

Community Article Published December 5, 2024

If you've been relying on cloud vector databases for managing and searching high-dimensional data, let me show you an alternative: deploying your own vector database locally. Not only does this give you full control, but it also eliminates recurring cloud costs and potential data privacy concerns. Using Annoy and Sentence Transformers, I’ll show you how to build a fast and efficient vector database tailored to your own embeddings model.


What Is a Vector Database Index?

A vector database index organizes high-dimensional data (e.g., embeddings) to enable efficient similarity searches. It’s an essential component for applications like semantic search, recommendation engines, and personalization systems.

In this guide, I’ll demonstrate how you can:

  • Use Annoy for approximate nearest neighbor (ANN) searches.
  • Use Sentence Transformers to generate text embeddings.

The result? A powerful local setup that can run entirely on your machine.


Prerequisites

To follow along, ensure you have:

  • Python installed.
  • Basic Python programming skills.
  • The required libraries:
    pip install annoy sentence-transformers
    

Step 1: Generating Embeddings

Embeddings are the backbone of any vector database. Here, I use the Sentence Transformers model to generate embeddings from text.

from sentence_transformers import SentenceTransformer

# Load the embeddings model
model = SentenceTransformer("jinaai/jina-embeddings-v2-base-en", trust_remote_code=True)

# Optional: Set the maximum sequence length
model.max_seq_length = 1024

def get_embedding(texts):
    """
    Generate embeddings for a list of texts.
    """
    embeddings = model.encode(texts)
    return embeddings.tolist()

Step 2: Managing Metadata and File Paths

Each embedding is associated with metadata, like a title or content, which is stored in JSON files for easy management.

import os
import json

def get_file_paths(email, db_name):
    """
    Generate paths for the index and metadata files.
    """
    base_dir = os.path.join('user_data', email, db_name)
    index_path = os.path.join(base_dir, 'vector_index.ann')
    metadata_path = os.path.join(base_dir, 'metadata.json')
    return base_dir, index_path, metadata_path

def load_metadata(email, db_name):
    """
    Load metadata from a JSON file.
    """
    _, _, metadata_path = get_file_paths(email, db_name)
    if os.path.exists(metadata_path):
        with open(metadata_path, 'r') as f:
            return json.load(f)
    return {}

def save_metadata(email, db_name, metadata):
    """
    Save metadata to a JSON file.
    """
    base_dir, _, metadata_path = get_file_paths(email, db_name)
    os.makedirs(base_dir, exist_ok=True)
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)

Step 3: Building the Index with Annoy

Annoy allows you to create an index for efficient approximate nearest neighbor searches. Here’s how you can build and save one locally.

from annoy import AnnoyIndex

num_trees = 10
num_dimensions = 768  # Match the dimension of your embeddings

def build_index(email, db_name):
    """
    Build an Annoy index from metadata and save it locally.
    """
    base_dir, index_path, _ = get_file_paths(email, db_name)
    index = AnnoyIndex(num_dimensions, 'angular')
    metadata = load_metadata(email, db_name)

    for item_id, item in metadata.items():
        try:
            index.add_item(int(item_id), item['vector'])
        except KeyError as e:
            print(f"Error adding item {item_id} to index: {e}")

    index.build(num_trees)
    index.save(index_path)
    print('Index built and saved.')

Step 4: Querying the Index

Once your index is built, you can query it with an embedding to find the most similar items.

def query_vector(email, db_name, vector, num_neighbors=5):
    """
    Query the Annoy index with a vector and return the closest neighbors.
    """
    index = load_index(email, db_name)
    metadata = load_metadata(email, db_name)

    if index.get_n_items() == 0:
        print('Index is empty or not loaded.')
        return []

    neighbors = index.get_nns_by_vector(vector, num_neighbors, include_distances=True)
    results = []
    for idx, distance in zip(neighbors[0], neighbors[1]):
        results.append({
            'id': idx,
            'title': metadata.get(str(idx), {}).get('title', 'N/A'),
            'content': metadata.get(str(idx), {}).get('content', 'N/A'),
            'distance': distance
        })
    return results

Step 5: Running the Full Pipeline

Let me show you how all these steps fit together.

  1. Generate Embeddings:

    texts = ["Example text 1", "Another example text"]
    vectors = get_embedding(texts)
    
  2. Add Data:

    email = "user@example.com"
    db_name = "my_database"
    
    for vector, text in zip(vectors, texts):
        add_vector(email, db_name, vector, title="Title", content=text)
    
  3. Build the Index:

    build_index(email, db_name)
    
  4. Query the Index:

    query_vector(email, db_name, vectors[0], num_neighbors=3)
    

Why Annoy Over Cloud-Based Solutions?

Using Annoy for local indexing has several advantages:

  • Cost Efficiency: No recurring charges for cloud vector database services.
  • Privacy: Your data never leaves your machine.
  • Performance: Optimized for fast, in-memory approximate nearest neighbor searches.

This approach works particularly well for projects where scalability and control are critical.


Extending This Guide

You can adapt this setup to fit more complex use cases:

  1. Use domain-specific models for embeddings.
  2. Integrate persistent storage like SQLite for metadata.
  3. Experiment with Annoy parameters (e.g., num_trees) for better performance.

By deploying your own vector database index, you gain control over the entire pipeline—from embedding generation to efficient retrieval. This setup is perfect for anyone looking to build scalable and privacy-conscious applications without relying on third-party cloud solutions. With these tools, you now have the foundation to power semantic search, recommendation engines, or any embedding-based application.