Spaces:
Running
Running
title: Search Mathlib | |
emoji: π₯ | |
colorFrom: yellow | |
colorTo: pink | |
sdk: docker | |
pinned: false | |
license: mit | |
Hi. I made a webpage that searches for Mathlib theorems. | |
https://huggingface.co/spaces/dx2102/search-mathlib | |
You can find the code and data I used here. I think the code is quite short. There is only one Lean file (about 100 lines) and two Python files (about 100 lines each). | |
https://huggingface.co/spaces/dx2102/search-mathlib/tree/main | |
Lean is a difficult language. I used this project as an exercise for my learning of Lean. | |
I originally only knew that Lean had a searcher based on symbol matching from Loogle (https://loogle.lean-lang.org). It was not until I finished making this webpage that I realized there were similar natural language-based searchers like https://moogle.ai and https://leansearch.net/ that was also announced today. | |
I think the principle of my webpage should be similar to moogle and leansearch. I used the text embedding model in natural language processing. These neural networks convert any text into a vector (e.g. a 1024-dimensional vector, represented by 1024 floating point numbers). This high-dimensional vector represents the "embedding" of the text in R^1024. Any two semantically close texts should correspond to similar vectors, as this is one of the training objectives of the embedding model. In other words, the angle or distance between the vectors should be small, where the angle is calculated by dot product. Therefore, this technology can now be used to help search (and other tasks, such as text classification, entity recognition, and helping chatbots). | |
I found a text embedding API service provided by VoyageAI. They are ranked high on the MTEB leaderboard, and I can use their server GPU instead of preparing one myself. After embedding Mathlib's theorem, I still have 14 million tokens (about 56 million English letters) of free credits left. I think this is enough to run this web page for now. | |
The text seen by the embedded model is like this: | |
` | |
-- import Mathlib.Tactic.Use | |
def Mathlib.Tactic.applyTheConstructor | |
: Lean.MVarId β Lean.MetaM (List Lean.MVarId Γ List Lean.MVarId Γ List Lean.MVarId) | |
/- When the goal `mvarId` is an inductive datatype with a single constructor, this applies that constructor, then returns metavariables for the non-parameter explicit arguments along with metavariables for the parameters and implicit arguments. The first list of returned metavariables correspond to the arguments that β¨x,y,...γ notation uses. The second list corresponds to everything else: the parameters and implicit arguments. -/ | |
` | |
Finally, I will briefly explain my 3 code files. Hopefully this will help others who want to do something similar. | |
extract.lean: This is a code I modified from lean-training-data (https://github.com/semorrison/lean-training-data). I also recommend lean-training-data to others who want to extract information from Lean. lean-dojo seems to be another option. The main content of the code is to import Mathlib through `CoreM.withImportModules #[\`Mathlib] do ...`. Then through `let env β getEnv`, view and print the information in the `Lean.Environment` data structure, including theorem name, type, document string, etc. This information is stored in theorems.txt. (Exporting takes about 15 minutes, and the progress will be displayed). | |
embed.ipynb: This code will call VoyageAI's API to convert each item in theorems.txt into a 1024-dimensional vector and store it in embed_arr.npy. | |
server.py: This code will mount a simple web page in the huggingface space server. It will also forward the user's search terms to VoyageAI, convert the search terms into vectors, and then match similar ones from the vectors defined by the existing theorem. |