Wikipedia txtai embeddings index

This is a txtai embeddings index (5GB embeddings + 25GB documents) for the english edition of Wikipedia.

Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server.

This index is built from the Wikipedia october 2024 dataset. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses Wikipedia Page Views data to add a percentile field. The percentile field can be used to only match commonly visited pages.

txtai must be (pip) installed to use this.

Example code

from txtai.embeddings import Embeddings
import json

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia")

# Run a search
for x in embeddings.search("Bob Dylans second album", 1):
  print(x["text"])

# Run a search and filter on popular results (page views).
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1):
  print(json.dumps(x, indent=2))

Example output

The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article)

{
  "id": "Where in the World Is Carmen Sandiego? (game show)",
  "text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article)
  "score": 0.8537465929985046,
  "percentile": 0.996002961084341
}

Data source

https://dumps.wikimedia.org/enwiki/

https://dumps.wikimedia.org/other/pageview_complete/

https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020

Downloads last month: 9

burgerbee
/

txtai-en-wikipedia

Wikipedia txtai embeddings index

Example code

Example output

Data source

Dataset used to train burgerbee/txtai-en-wikipedia