--- inference: false language: sv license: - cc-by-sa-3.0 - gfdl library_name: txtai tags: - sentence-similarity datasets: - burgerbee/wikipedia-sv-20240220 --- # Wikipedia txtai embeddings index This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/). This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220). Only the first two paragraph from each article is included. It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used to only match commonly visited pages. txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. ## Example ```python from txtai.embeddings import Embeddings import json # Load the index from the HF Hub embeddings = Embeddings() embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-sv-wikipedia") # Run a search for x in embeddings.search("I vilken stad ligger Liseberg?", 1): print(json.dumps(x, indent=2)) # Run a search and filter on popular results (page views). for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('I vilken stad ligger Liseberg?') AND percentile >= 0.99", 1): print(json.dumps(x, indent=2)) ``` ## Use Cases An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions. See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model. # Source https://dumps.wikimedia.org/svwiki/20240220/dumpstatus.json https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-02/pageviews-202402-user.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream1.xml-p1p153415.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream2.xml-p153416p666977.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream3.xml-p666978p1690769.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p1690770p3190769.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream4.xml-p3190770p3794371.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p3794372p5294371.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream5.xml-p5294372p6319736.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p6319737p7819736.bz2 https://dumps.wikimedia.org/svwiki/20240220/svwiki-20240220-pages-articles-multistream6.xml-p7819737p8827284.bz2