alanztymarqo commited on
Commit
27ceae9
1 Parent(s): 03396fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -124
README.md CHANGED
@@ -5430,95 +5430,13 @@ license: mit
5430
 
5431
 
5432
 
5433
- # Updates
5434
-
5435
- New open-source models and ToDoList will be listed on https://github.com/DunZhang/Stella/blob/main/news_and_todo.md.
5436
-
5437
- You can also find these models on my [homepage](https://huggingface.co/infgrad).
5438
-
5439
- # Introduction
5440
-
5441
- The models are trained based on `Alibaba-NLP/gte-large-en-v1.5` and `Alibaba-NLP/gte-Qwen2-1.5B-instruct`. Thanks for
5442
- their contributions!
5443
-
5444
- **We simplify usage of prompts, providing two prompts for most general tasks, one is for s2p, another one is for s2s.**
5445
-
5446
- Prompt of s2p task(e.g. retrieve task):
5447
-
5448
- ```text
5449
- Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {query}
5450
- ```
5451
-
5452
- Prompt of s2s task(e.g. semantic textual similarity task):
5453
-
5454
- ```text
5455
- Instruct: Retrieve semantically similar text.\nQuery: {query}
5456
- ```
5457
-
5458
- The models are finally trained by [MRL](https://arxiv.org/abs/2205.13147), so they have multiple dimensions: 512, 768,
5459
- 1024, 2048, 4096, 6144 and 8192.
5460
-
5461
- The higher the dimension, the better the performance.
5462
- **Generally speaking, 1024d is good enough.** The MTEB score of 1024d is only 0.001 lower than 8192d.
5463
-
5464
- # Model directory structure
5465
-
5466
- The model directory structure is very simple, it is a standard SentenceTransformer directory **with a series
5467
- of `2_Dense_{dims}`
5468
- folders**, where `dims` represents the final vector dimension.
5469
-
5470
- For example, the `2_Dense_256` folder stores Linear weights that convert vector dimensions to 256 dimensions.
5471
- Please refer to the following chapters for specific instructions on how to use them.
5472
-
5473
- # Usage
5474
-
5475
- You can use `SentenceTransformers` or `transformers` library to encode text.
5476
-
5477
- ## Sentence Transformers
5478
-
5479
- ```python
5480
- from sentence_transformers import SentenceTransformer
5481
-
5482
- # This model supports two prompts: "s2p_query" and "s2s_query" for sentence-to-passage and sentence-to-sentence tasks, respectively.
5483
- # They are defined in `config_sentence_transformers.json`
5484
- query_prompt_name = "s2p_query"
5485
- queries = [
5486
- "What are some ways to reduce stress?",
5487
- "What are the benefits of drinking green tea?",
5488
- ]
5489
- # docs do not need any prompts
5490
- docs = [
5491
- "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
5492
- "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
5493
- ]
5494
-
5495
- # !The default dimension is 1024, if you need other dimensions, please clone the model and modify `modules.json` to replace `2_Dense_1024` with another dimension, e.g. `2_Dense_256` or `2_Dense_8192` !
5496
- # on gpu
5497
- model = SentenceTransformer("dunzhang/stella_en_400M_v5", trust_remote_code=True).cuda()
5498
- # you can also use this model without the features of `use_memory_efficient_attention` and `unpad_inputs`. It can be worked in CPU.
5499
- # model = SentenceTransformer(
5500
- # "dunzhang/stella_en_400M_v5",
5501
- # trust_remote_code=True,
5502
- # device="cpu",
5503
- # config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False}
5504
- # )
5505
- query_embeddings = model.encode(queries, prompt_name=query_prompt_name)
5506
- doc_embeddings = model.encode(docs)
5507
- print(query_embeddings.shape, doc_embeddings.shape)
5508
- # (2, 1024) (2, 1024)
5509
-
5510
- similarities = model.similarity(query_embeddings, doc_embeddings)
5511
- print(similarities)
5512
- # tensor([[0.8398, 0.2990],
5513
- # [0.3282, 0.8095]])
5514
- ```
5515
 
5516
  ## Transformers
5517
 
5518
  ```python
5519
  import os
5520
  import torch
5521
- from transformers import AutoModel, AutoTokenizer
5522
  from sklearn.preprocessing import normalize
5523
 
5524
  query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
@@ -5534,23 +5452,11 @@ docs = [
5534
  ]
5535
 
5536
  # The path of your model after cloning it
5537
- model_dir = "{Your MODEL_PATH}"
5538
-
5539
- vector_dim = 1024
5540
- vector_linear_directory = f"2_Dense_{vector_dim}"
5541
  model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
5542
- # you can also use this model without the features of `use_memory_efficient_attention` and `unpad_inputs`. It can be worked in CPU.
5543
- # model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,use_memory_efficient_attention=False,unpad_inputs=False).cuda().eval()
5544
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
5545
- vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
5546
- vector_linear_dict = {
5547
- k.replace("linear.", ""): v for k, v in
5548
- torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin")).items()
5549
- }
5550
- vector_linear.load_state_dict(vector_linear_dict)
5551
- vector_linear.cuda()
5552
 
5553
- # Embed the queries
5554
  with torch.no_grad():
5555
  input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
5556
  input_data = {k: v.cuda() for k, v in input_data.items()}
@@ -5558,7 +5464,7 @@ with torch.no_grad():
5558
  last_hidden_state = model(**input_data)[0]
5559
  last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
5560
  query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
5561
- query_vectors = normalize(vector_linear(query_vectors).cpu().numpy())
5562
 
5563
  # Embed the documents
5564
  with torch.no_grad():
@@ -5568,7 +5474,7 @@ with torch.no_grad():
5568
  last_hidden_state = model(**input_data)[0]
5569
  last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
5570
  docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
5571
- docs_vectors = normalize(vector_linear(docs_vectors).cpu().numpy())
5572
 
5573
  print(query_vectors.shape, docs_vectors.shape)
5574
  # (2, 1024) (2, 1024)
@@ -5579,28 +5485,3 @@ print(similarities)
5579
  # [0.32818374 0.80954516]]
5580
  ```
5581
 
5582
- # FAQ
5583
-
5584
- Q: The details of training?
5585
-
5586
- A: The training method and datasets will be released in the future. (specific time unknown, may be provided in a paper)
5587
-
5588
- Q: How to choose a suitable prompt for my own task?
5589
-
5590
- A: In most cases, please use the s2p and s2s prompts. These two prompts account for the vast majority of the training
5591
- data.
5592
-
5593
- Q: How to reproduce MTEB results?
5594
-
5595
- A: Please use evaluation scripts in `Alibaba-NLP/gte-Qwen2-1.5B-instruct` or `intfloat/e5-mistral-7b-instruct`
5596
-
5597
- Q: Why each dimension has a linear weight?
5598
-
5599
- A: MRL has multiple training methods, we choose this method which has the best performance.
5600
-
5601
- Q: What is the sequence length of models?
5602
-
5603
- A: 512 is recommended, in our experiments, almost all models perform poorly on specialized long text retrieval datasets. Besides, the
5604
- model is trained on datasets of 512 length. This may be an optimization term.
5605
-
5606
- If you have any questions, please start a discussion on community.
 
5430
 
5431
 
5432
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5433
 
5434
  ## Transformers
5435
 
5436
  ```python
5437
  import os
5438
  import torch
5439
+ from transformers import AutoModel, AutoTokenizer, AutoConfig
5440
  from sklearn.preprocessing import normalize
5441
 
5442
  query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
 
5452
  ]
5453
 
5454
  # The path of your model after cloning it
5455
+ model_dir = "Marqo/dunzhang-stella_en_400M_v5"
 
 
 
5456
  model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
 
 
5457
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 
 
 
 
 
 
 
5458
 
5459
+
5460
  with torch.no_grad():
5461
  input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
5462
  input_data = {k: v.cuda() for k, v in input_data.items()}
 
5464
  last_hidden_state = model(**input_data)[0]
5465
  last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
5466
  query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
5467
+ query_vectors = normalize(query_vectors.cpu().numpy())
5468
 
5469
  # Embed the documents
5470
  with torch.no_grad():
 
5474
  last_hidden_state = model(**input_data)[0]
5475
  last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
5476
  docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
5477
+ docs_vectors = normalize(docs_vectors.cpu().numpy())
5478
 
5479
  print(query_vectors.shape, docs_vectors.shape)
5480
  # (2, 1024) (2, 1024)
 
5485
  # [0.32818374 0.80954516]]
5486
  ```
5487