txtai-sv-wikipedia / README.md
burgerbee's picture
Update README.md
b519a78 verified
|
raw
history blame
No virus
2.94 kB
metadata
inference: false
language: sv
license:
  - cc-by-sa-3.0
  - gfdl
library_name: txtai
tags:
  - sentence-similarity
datasets:
  - NeuML/wikipedia-20240101

Wikipedia txtai embeddings index

This is a txtai embeddings index for the Swedish edition of Wikipedia.

This index is built from the Wikipedia Februari 2024 dataset. Only the first two paragraph from each article is included.

It also uses Wikipedia Page Views data to add a percentile field. The percentile field can be used to only match commonly visited pages.

txtai must be installed to use this model.

Example

from txtai.embeddings import Embeddings

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Run a search
embeddings.search("Roman Empire")

# Run a search matching only the Top 1% of articles
embeddings.search("""
   SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
   percentile >= 0.99
""")

Source

https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json

https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-02/pageviews-202402-user.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream1.xml-p1p153415.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream2.xml-p153416p666977.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream3.xml-p666978p1690769.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p1690770p3190769.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p3190770p3794371.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p3794372p5294371.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p5294372p6319736.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p6319737p7819736.bz2

https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p7819737p8827284.bz2

Use Cases

An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

See this article for additional examples on how to use this model.