File size: 6,749 Bytes
2b850ea b7071bf 260e63d b7071bf 831cafb b7071bf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
license: mit
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
---
# *LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models*
LUSIFER is framework for bridging the gap between multilingual understanding and task-specific text embeddings without relying on explicit multilingual supervision. It does this by combining a multilingual encoder (providing a universal language foundation) with an LLM-based embedding model (optimized for embedding tasks), connected through a minimal set of trainable parameters. LUSIFER also introduces two stages of training process: 1) Alignment Training and 2) Representation Fine-tuning to optimize the model for zero-shot multilingual embeddings.
## Installation
To use LUSFIER, install evironment from ```environment.yaml``` (optional)
```bash
conda env create -f environment.yaml
```
After that, you can install our package from source by
```bash
git clone https://github.com/hieum98/lusifer.git
cd lusifer
pip install -e .
```
You also need to install the Flash-Attention before running the code because we use the Flash-Attention as the attention implementation in our model. You can install the Flash-Attention by running the following command:
```bash
pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation
```
## Getting started
LUSIFER provides a thorough set of tools for training, evaluating, and using the model. The following sections provide a brief overview of how to use the model for training, evaluation, and inference.
### Preparing the model
LUSIFER model can be easily loaded using the `from_pretrained` method. The model can be loaded from the Hugging Face model hub by providing the model name or path to the model weights. The following code snippet demonstrates how to load the model from the Hugging Face model hub.
```python
from lusifer.models.lusifer import Lusifer
model = Lusifer.from_pretrained("Hieuman/LUSIFER")
```
### Inference
This model now returns the text embedding for any input in the form of `str` or `List[str]`. The model also can receive instruction alongside the sentence.
```python
import torch
from lusifer.models.lusifer import Lusifer
model = Lusifer.from_pretrained("Hieuman/LUSIFER")
model = model.to("cuda")
# Encoding queries using instructions
instruction = "Given a web search query, retrieve relevant passages that answer the query:"
queries = [
"how much protein should a female eat",
"summit define",
]
q_reps = model.encode(sentences=queries)
# Encoding documents. Instruction are not required for documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = model.encode(sentences=documents)
# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(torch.from_numpy(q_reps), p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(torch.from_numpy(d_reps), p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)
```
## Training
### Alignment Training
To train the model in the alignment stage, run the following command:
```bash
python -m src.main \
--config_file scripts/configs/aligment_training_reconstruction_and_completion.yaml \
--nodes 1 \
--devices 4
```
It will run the alignment training on 4 GPUs with both reconstruction and completion tasks with the configuration in the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file. For more details about the configuration file, please refer to the `scripts/configs/aligment_training_reconstruction_and_completion.yaml` file and the arguments in the `lusifer/args.py` file.
We also provide the configuration file for the alignment training with the reconstruction task only in the `scripts/configs/alignment_training_reconstruction.yaml` file. We suggest using the reconstruction task only first to stabilize the training process before adding the completion task.
### Representation Fine-tuning
To train the model in the representation fine-tuning stage, run the following command:
```bash
python -m src.main \
--config_file scripts/configs/representation_fintuning_retrieval_data_only.yaml \
--nodes 1 \
--devices 4
```
We also provide the configuration file for the representation fine-tuning with both retrieval and non-retrieval data in the `scripts/configs/representation_finetuning_all.yaml` file. We suggest using the retrieval data only first to stabilize the training process before adding the non-retrieval data.
To be concise, we suggest the following training process: reconstruction task only -> reconstruction + completion task -> retrieval data only -> retrieval + non-retrieval data.
## Evaluation
We propose a new benchmark for evaluating the model on the multilingual text embedding task. The benchmark includes 5 primary embedding tasks: Classification, Clustering, Reranking, Retrieval, and Semantic Textual Similarity (STS) across 123 diverse datasets spanning 14 languages
We support to evaluate model on various datasets by intergrating [`mteb`](https://github.com/embeddings-benchmark/mteb) library. To evaluate the model, run the following command:
```bash
python -m lusifer.eval.eval \
--model_name_or_path Hieuman/LUSIFER \
--is_lusifer \
```
## Results
We provide the results of LUSIFER on the multilingual text embedding benchmark in the following table. The results are reported in terms of the average main metric across all tasks and datasets. Please refer the paper for the full results
## Citation
If you use LUSIFER in your research, please cite the following paper:
```bibtex
@misc{man2025lusiferlanguageuniversalspace,
title={LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models},
author={Hieu Man and Nghia Trung Ngo and Viet Dac Lai and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
year={2025},
eprint={2501.00874},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.00874},
}
```
## Bugs or questions?
If you have any questions about the code, feel free to open an issue on the GitHub repository or send me an email at hieum@uoregon.edu.
|