Commit
·
459d697
1
Parent(s):
4379b79
January 2024 data update
Browse files- README.md +40 -6
- config.json +6 -6
- documents +2 -2
- embeddings +2 -2
README.md
CHANGED
@@ -8,16 +8,14 @@ library_name: txtai
|
|
8 |
tags:
|
9 |
- sentence-similarity
|
10 |
datasets:
|
11 |
-
-
|
12 |
---
|
13 |
|
14 |
# Wikipedia txtai embeddings index
|
15 |
|
16 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
|
17 |
|
18 |
-
This index is built from the [
|
19 |
-
Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index.
|
20 |
-
This is similar to an abstract of the article.
|
21 |
|
22 |
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
|
23 |
to only match commonly visited pages.
|
@@ -49,7 +47,43 @@ embeddings.search("""
|
|
49 |
|
50 |
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
|
51 |
|
52 |
-
The Wikipedia index works well as a fact-based context source for
|
53 |
-
context in which to answer questions.
|
54 |
|
55 |
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
tags:
|
9 |
- sentence-similarity
|
10 |
datasets:
|
11 |
+
- neuml/wikipedia-20240101
|
12 |
---
|
13 |
|
14 |
# Wikipedia txtai embeddings index
|
15 |
|
16 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
|
17 |
|
18 |
+
This index is built from the [Wikipedia January 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240101). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
|
|
|
|
|
19 |
|
20 |
It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
|
21 |
to only match commonly visited pages.
|
|
|
47 |
|
48 |
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
|
49 |
|
50 |
+
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
|
|
|
51 |
|
52 |
See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
|
53 |
+
|
54 |
+
## Build the index
|
55 |
+
|
56 |
+
The following steps show how to build this index. These scripts are using the latest data available as of 2024-01-01, update as appropriate.
|
57 |
+
|
58 |
+
- Install required build dependencies
|
59 |
+
```bash
|
60 |
+
pip install txtchat mwparserfromhell datasets
|
61 |
+
```
|
62 |
+
|
63 |
+
- Download and build pageviews database
|
64 |
+
```bash
|
65 |
+
mkdir -p pageviews/data
|
66 |
+
wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2023/2023-12/pageviews-202312-user.bz2
|
67 |
+
python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
|
68 |
+
```
|
69 |
+
|
70 |
+
- Build Wikipedia dataset
|
71 |
+
|
72 |
+
```python
|
73 |
+
from datasets import load_dataset
|
74 |
+
|
75 |
+
# Data dump date from https://dumps.wikimedia.org/enwiki/
|
76 |
+
date = "20240101"
|
77 |
+
|
78 |
+
# Build and save dataset
|
79 |
+
ds = load_dataset("neuml/wikipedia", language="en", date=date)
|
80 |
+
ds.save_to_disk(f"wikipedia-{date}")
|
81 |
+
```
|
82 |
+
|
83 |
+
- Build txtai-wikipedia index
|
84 |
+
```bash
|
85 |
+
python -m txtchat.data.wikipedia.index \
|
86 |
+
-d wikipedia-20240101 \
|
87 |
+
-o txtai-wikipedia \
|
88 |
+
-v pageviews/pageviews.sqlite
|
89 |
+
```
|
config.json
CHANGED
@@ -14,15 +14,15 @@
|
|
14 |
"content": true,
|
15 |
"dimensions": 768,
|
16 |
"backend": "faiss",
|
17 |
-
"offset":
|
18 |
"build": {
|
19 |
-
"create": "
|
20 |
-
"python": "3.
|
21 |
"settings": {
|
22 |
-
"components": "
|
23 |
},
|
24 |
"system": "Linux (x86_64)",
|
25 |
-
"txtai": "
|
26 |
},
|
27 |
-
"update": "
|
28 |
}
|
|
|
14 |
"content": true,
|
15 |
"dimensions": 768,
|
16 |
"backend": "faiss",
|
17 |
+
"offset": 6172387,
|
18 |
"build": {
|
19 |
+
"create": "2024-01-10T20:34:13Z",
|
20 |
+
"python": "3.8.18",
|
21 |
"settings": {
|
22 |
+
"components": "IVF2222,SQ8"
|
23 |
},
|
24 |
"system": "Linux (x86_64)",
|
25 |
+
"txtai": "6.4.0"
|
26 |
},
|
27 |
+
"update": "2024-01-10T20:34:13Z"
|
28 |
}
|
documents
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3074eee183918963965a7cdb7c6371150131bf9d99e04d141b20095cd8183b2c
|
3 |
+
size 3237478400
|
embeddings
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6f4dea2a142773ae579125431036c7adb632e597a7cd9ed04fbe5473e5f83201
|
3 |
+
size 4796622400
|