jaspercatapang commited on
Commit
79d923d
1 Parent(s): 6d6070e

Upload initial files

Browse files
Files changed (8) hide show
  1. README.md +68 -0
  2. config.json +32 -0
  3. logo.png +0 -0
  4. model.safetensors +3 -0
  5. special_tokens_map.json +37 -0
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +65 -0
  8. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - retrieval
7
+ - retriever
8
+ - rag
9
+ inference: false
10
+ ---
11
+
12
+ <img src="logo.png" width=25%>
13
+
14
+ # Model Description
15
+ RoBERTA ReRanker for Retrieved Results or **R*** (pronounced R-star) is an advanced model designed to enhance search results' relevance and accuracy through reranking. By integrating the retrieval capabilities of **R*** with generative models, this hybrid approach significantly enhances the relevance and contextual depth of search results. Based on the transformative RoBERTa tiny architecture ([RoBERTa tiny](https://huggingface.co/haisongzhang/roberta-tiny-cased)), **R*** is specialized in distinguishing relevant from irrelevant query-passage pairs, thereby refining the output of LLMs in retrieval and generative tasks.
16
+
17
+ ## Training Data
18
+ R* was trained on a dataset derived from the MS MARCO passage ranking dataset, consisting of 2.5 million query-positive passage pairs and an equal number of query-negative passage pairs, totaling 5 million query-passage pairs. This ensures a balanced training approach, exposing R* to both relevant and irrelevant examples equally.
19
+
20
+ ## Training Procedure
21
+ Training focused on binary classification, aiming to assign a continuous relevance score ranging from 0 (irrelevant) to 1 (relevant) for each query-passage pair. The model underwent training for 7 epochs with a batch size of 2048, utilizing a Colab Pro instance equipped with a V100 GPU (16 GB VRAM) and 51 GB RAM, completing in approximately 16 hours.
22
+
23
+ ## Evaluation and Performance
24
+ Coming soon.
25
+
26
+ ## Use Cases
27
+ R* is particularly suitable for applications that demand high precision in information retrieval, such as RAG reranking, search engine results, document searching in legal or academic databases, recommendation systems, and beyond.
28
+
29
+ ## How to Use
30
+ ### With Transformers
31
+ For usage with the Transformers library, you can follow this generic example:
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
35
+ import torch
36
+
37
+ model = AutoModelForSequenceClassification.from_pretrained('NLPinas/R-star')
38
+ tokenizer = AutoTokenizer.from_pretrained('NLPinas/R-star')
39
+
40
+ features = tokenizer(['Your query here', 'First relevant passage for first query'], ['Your query here', 'Second relevant passage for second query'], padding=True, truncation=True, return_tensors="pt")
41
+
42
+ model.eval()
43
+ with torch.no_grad():
44
+ scores = model(**features).logits
45
+ print(scores)
46
+ ```
47
+
48
+ ### With SentenceTransformers
49
+ ```python
50
+ from sentence_transformers import CrossEncoder
51
+ model = CrossEncoder('NLPinas/R-star', max_length=512)
52
+ scores = model.predict([('Your query here', 'First relevant passage for first query'), ('Your query here', 'Second relevant passage for second query')])
53
+ ```
54
+
55
+ ## Limitations
56
+ Based on our evaluation, R* tends to favor longer passages when scoring, which could introduce a bias. This is true for most cross-encoder models. It is advisable to preprocess text to normalize passage lengths for fair comparison. Note that R* is optimized for passage-level comparisons and may not perform well on word- or phrase-level similarity tasks.
57
+
58
+ ## Ethical Considerations
59
+ The use of R* introduces several ethical considerations, including potential biases in the training data, privacy concerns, and the implications of automating decision-making processes. Users are encouraged to critically evaluate the model's fairness and transparency, ensuring its equitable use across diverse demographics.
60
+
61
+ ## Contact Details
62
+ For additional information or inquiries about R*, please contact the developer via jasperkylecatapang@gmail.com
63
+
64
+ ## Disclaimer
65
+ R* is an AI language model from the online community, NLPinas. It is provided "as is" without warranty of any kind, expressed or implied. The model developers and NLPinas shall not be liable for any direct or indirect damages arising from the use of this model.
66
+
67
+ ## Acknowledgments
68
+ Thank you to Microsoft for the MS MARCO dataset. We would also like to extend our gratitude to [Haisong Zhang](https://huggingface.co/haisongzhang) for the base model.
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "NLPinas/R-star-epoch-5",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 512,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 2048,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 8,
24
+ "num_hidden_layers": 4,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.38.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 28996
32
+ }
logo.png ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:794c5c56e37d5fb294ab7170cd0a9422b2f8c1b98b832c444fd642ca4b25ae5c
3
+ size 111939740
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "full_tokenizer_file": null,
49
+ "mask_token": "[MASK]",
50
+ "max_length": 512,
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff