model documentation

#2
by nazneen - opened
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,4 +1,175 @@
1
- This model is converted from the original ANCE [repo](https://github.com/microsoft/ANCE) and fitted into Pyserini:
2
- > Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk. [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval](https://arxiv.org/pdf/2007.00808.pdf)
 
3
 
4
- For more details on how to use it, check our experiments in [Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-ance.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
 
5
+ ---
6
+ # Model Card for ance-msmarco-passage
7
+
8
+
9
+ Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
10
+
11
+ # Model Details
12
+
13
+ ## Model Description
14
+
15
+ Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture
16
+
17
+ - **Developed by:** Castorini
18
+ - **Shared by [Optional]:** Hugging Face
19
+ - **Model type:** Information retrieval
20
+ - **Language(s) (NLP):** en
21
+ - **License:** More information needed
22
+ - **Related Models:** More information needed
23
+ - **Parent Model:** RoBERTa
24
+ - **Resources for more information:**
25
+ - [GitHub Repo](https://github.com/castorini/pyserini)
26
+ - [Associated Paper](https://dl.acm.org/doi/pdf/10.1145/3404835.3463238)
27
+
28
+ # Uses
29
+
30
+
31
+ ## Direct Use
32
+
33
+ More information needed
34
+
35
+ ## Downstream Use [Optional]
36
+
37
+ More information needed
38
+
39
+ ## Out-of-Scope Use
40
+
41
+ More information needed
42
+
43
+ # Bias, Risks, and Limitations
44
+
45
+
46
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
47
+
48
+
49
+ ## Recommendations
50
+
51
+
52
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
53
+
54
+
55
+ # Training Details
56
+
57
+ ## Training Data
58
+
59
+ More information needed
60
+
61
+ ## Training Procedure
62
+
63
+
64
+
65
+ ### Preprocessing
66
+
67
+ More information needed
68
+
69
+ ### Speeds, Sizes, Times
70
+
71
+ More information needed
72
+
73
+ # Evaluation
74
+
75
+
76
+
77
+ ## Testing Data, Factors & Metrics
78
+
79
+ ### Testing Data
80
+
81
+ The model creators note in the [associated Paper](https://dl.acm.org/doi/pdf/10.1145/3404835.3463238) that:
82
+ > bag-of-words ranking with BM25 (the default ranking model) on the MS MARCO passage corpus (comprising 8.8M passages)
83
+
84
+
85
+ ### Factors
86
+
87
+ More information needed
88
+
89
+ ### Metrics
90
+
91
+ More information needed
92
+
93
+ ## Results
94
+
95
+ More information needed
96
+
97
+ # Model Examination
98
+
99
+ More information needed
100
+
101
+ # Environmental Impact
102
+
103
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
104
+
105
+ - **Hardware Type:** More information needed
106
+ - **Hours used:** More information needed
107
+ - **Cloud Provider:** More information needed
108
+ - **Compute Region:** More information needed
109
+ - **Carbon Emitted:** More information needed
110
+
111
+ # Technical Specifications [optional]
112
+
113
+ ## Model Architecture and Objective
114
+ More information needed
115
+
116
+ ## Compute Infrastructure
117
+
118
+ More information needed
119
+
120
+ ### Hardware
121
+
122
+ More information needed
123
+
124
+ ### Software
125
+
126
+ For bag-of-words sparse retrieval, we have built in Anserini (written in Java) custom parsers and ingestion pipelines for common document formats used in IR research,
127
+
128
+
129
+ # Citation
130
+
131
+
132
+ **BibTeX:**
133
+
134
+ ```bibtex
135
+
136
+ @INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
137
+ author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
138
+ title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
139
+ booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
140
+ year = 2021,
141
+ pages = "2356--2362",
142
+ }
143
+ ```
144
+
145
+
146
+ # Glossary [optional]
147
+
148
+ More information needed
149
+
150
+ # More Information [optional]
151
+
152
+ More information needed
153
+
154
+ # Model Card Authors [optional]
155
+
156
+ Castorini in collaboration with Ezi Ozoani and the Hugging Face team.
157
+
158
+ # Model Card Contact
159
+
160
+ More information needed
161
+
162
+ # How to Get Started with the Model
163
+
164
+ Use the code below to get started with the model.
165
+ <details>
166
+ <summary> Click to expand </summary>
167
+
168
+ ```python
169
+ from transformers import AutoTokenizer, AnceEncoder
170
+
171
+ tokenizer = AutoTokenizer.from_pretrained("castorini/ance-msmarco-passage")
172
+
173
+ model = AnceEncoder.from_pretrained("castorini/ance-msmarco-passage")
174
+ ```
175
+ </details>