File size: 7,981 Bytes
ac67bb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
## PRAM: Place Recognition Anywhere Model for Efficient Visual Localization

<p align="center">
  <img src="assets/overview.png" width="960">
</p>

Humans localize themselves efficiently in known environments by first recognizing landmarks defined on certain objects
and their spatial relationships, and then verifying the location by aligning detailed structures of recognized objects
with those in the memory. Inspired by this, we propose the place recognition anywhere model (PRAM) to perform visual
localization as efficiently as humans do. PRAM consists of two main components - recognition and registration. In
detail, first of all, a self-supervised map-centric landmark definition strategy is adopted, making places in either
indoor or outdoor scenes act as unique landmarks. Then, sparse keypoints extracted from images, are utilized as the
input to a transformer-based deep neural network for landmark recognition; these keypoints enable PRAM to recognize
hundreds of landmarks with high time and memory efficiency. Keypoints along with recognized landmark labels are further
used for registration between query images and the 3D landmark map. Different from previous hierarchical methods, PRAM
discards global and local descriptors, and reduces over 90% storage. Since PRAM utilizes recognition and landmark-wise
verification to replace global reference search and exhaustive matching respectively, it runs 2.4 times faster than
prior state-of-the-art approaches. Moreover, PRAM opens new directions for visual localization including multi-modality
localization, map-centric feature learning, and hierarchical scene coordinate regression.

* Full paper
  PDF: [Place Recognition Anywhere Model for Efficient Visual Localization](https://arxiv.org/pdf/2404.07785.pdf).

* Authors: *Fei Xue, Ignas Budvytis, Roberto Cipolla*

* Website: [PRAM](https://feixue94.github.io/pram-project) for videos, slides, recent updates, and datasets.

## Key Features

### 1. Self-supervised landmark definition on 3D space

- No need of segmentations on images
- No inconsistent semantic results from multi-view images
- No limitation to labels of only known objects
- Work in any places with known or unknown objects
- Landmark-wise 3D map sparsification

<p align="center">
  <img src="assets/map_sparsification.gif" width="640">
</p>

### 2. Efficient landmark-wise coarse and fine localization

- Recognize landmarks as opposed to do global retrieval
- Local landmark-wise matching as opposed to exhaustive matching
- No global descriptors (e.g. NetVLAD)
- No reference images and their heavy repetative 2D keypoints and descriptors
- Automatic inlier/outlier idetification

<p align="center">
  <img src="assets/pipeline1.png" width="640">
</p>

### 4. Sparse recognition

- Sparse SFD2 keypoints as tokens
- No uncertainties of points at boundaries
- Flexible to accept multi-modality inputs

### 5. Relocalization and temporal localization

- Per frame reclocalization from scratch
- Tracking previous frames for higher efficiency

### 6. One model one dataset

- All 7 subscenes in 7Scenes dataset share a model
- All 12 subscenes in 12Scenes dataset share a model
- All 5 subscenes in CambridgeLandmarks share a model

### 7. Robust to long-term changes

<p align="center">
  <img src="assets/pram_demo.gif" width="640">
</p>

## Open problems

- Adaptive number landmarks determination
- Using SAM + open vocabulary to generate semantic map
- Multi-modality localization with other tokenized signals (e.g. text, language, GPS, Magonemeter)
- More effective solutions to 3D sparsification

## Preparation

1. Download the 7Scenes, 12Scenes, CambridgeLandmarks, and Aachen datasets (remove redundant depth images otherwise they
   will be found in the sfm process)
2. Environments

2.1 Create a virtual environment

```
conda env create -f environment.yml
(do not activate pram before pangolin is installed)
```

2.2 Compile Pangolin for the installed python

```
git clone --recursive https://github.com/stevenlovegrove/Pangolin.git
cd Pangolin
git checkout v0.8

# Install dependencies
./scripts/install_prerequisites.sh recommended

# Compile with your python
cmake -DPython_EXECUTABLE=/your path to/anaconda3/envs/pram/bin/python3  -B build
cmake --build build -t pypangolin_pip_install

conda activate pram
```

## Run the localization with online visualization

1. Download the [3D-models](https://drive.google.com/drive/folders/1DUB073KxAjsc8lxhMpFuxPRf0ZBQS6NS?usp=drive_link),
   pretrained [models](https://drive.google.com/drive/folders/1E2QvujCevqnyg_CM9FGAa0AxKkt4KbLD?usp=drive_link) ,
   and [landmarks](https://drive.google.com/drive/folders/1r9src9bz7k3WYGfaPmKJ9gqxuvdfxZU0?usp=sharing)
2. Put pretrained models in ```weights``` directory
3. Run the demo (e.g. 7Scenes)

```
python3 inference.py  --config configs/config_train_7scenes_sfd2.yaml --rec_weight_path weights/7scenes_nc113_birch_segnetvit.199.pth  --landmark_path /your path to/landmarks --online
```

## Train the recognition model (e.g. for 7Scenes)

### 1. Do SfM with SFD2 including feature extraction (modify the dataset_dir, ref_sfm_dir, output_dir)

```
./sfm_scripts/reconstruct_7scenes.sh
```

This step will produce the SfM results together with the extracted keypoints

### 2. Generate 3D landmarks

```
python3 -m recognition.recmap --dataset 7Scenes --dataset_dir /your path to/7Scenes --sfm_dir /sfm_path/7Scenes --save_dir /save_path/landmakrs
```

This step will generate 3D landmarks, create virtual reference frame, and sparsify the 3D points for each landmark for
all scenes in 7Scenes

### 3. Train the sparse recognition model (one model one dataset)

```
python3 train.py   --config configs/config_train_7scenes_sfd2.yaml
```

Remember to modify the paths in 'config_train_7scenes_sfd2.yaml'

## Your own dataset

1. Run colmap or hloc to obtain the SfM results
2. Do reconstruction with SFD2 keypoints with the sfm from step as refernece sfm
3. Do 3D landmark generation, VRF, map sparsification etc (Add DatasetName.yaml to configs/datasets)
4. Train the recognition model
5. Do evaluation

## Previous works can be found here

1. [Efficient large-scale localization by landmark recognition, CVPR 2022](https://github.com/feixue94/lbr)
2. [IMP: Iterative Matching and Pose Estimation with Adaptive Pooling, CVPR 2023](https://github.com/feixue94/imp-release)
3. [SFD2: Semantic-guided Feature Detection and Description, CVPR 2023](https://github.com/feixue94/sfd2)
4. [VRS-NeRF: Visual Relocalization with Sparse Neural Radiance Field, under review](https://github.com/feixue94/vrs-nerf)

## BibTeX Citation

If you use any ideas from the paper or code in this repo, please consider citing:

```
 @article{xue2024pram,
          author    = {Fei Xue and Ignas Budvytis and Roberto Cipolla},
          title     = {PRAM: Place Recognition Anywhere Model for Efficient Visual Localization},
          journal   = {arXiv preprint arXiv:2404.07785},
          year      = {2024}
 }

@inproceedings{xue2023sfd2,
  author    = {Fei Xue and Ignas Budvytis and Roberto Cipolla},
  title     = {SFD2: Semantic-guided Feature Detection and Description},
  booktitle = {CVPR},
  year      = {2023}
}

@inproceedings{xue2022imp,
  author    = {Fei Xue and Ignas Budvytis and Roberto Cipolla},
  title     = {IMP: Iterative Matching and Pose Estimation with Adaptive Pooling},
  booktitle = {CVPR},
  year      = {2023}
}

@inproceedings{xue2022efficient,
  author    = {Fei Xue and Ignas Budvytis and Daniel Olmeda Reino and Roberto Cipolla},
  title     = {Efficient Large-scale Localization by Global Instance Recognition},
  booktitle = {CVPR},
  year      = {2022}
}
```

## Acknowledgements

Part of the code is from previous excellent works
including , [SuperGlue](https://github.com/magicleap/SuperGluePretrainedNetwork)
and [hloc](https://github.com/cvg/Hierarchical-Localization). You can find more details from their released
repositories if you are interested in their works.