Henry65
/

RepoSim4Py

Feature Extraction

code-understanding

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

Henry65 commited on Aug 3, 2023

Commit

cdbc2dc

•

1 Parent(s): 038eba5

Update README.md

Files changed (1) hide show

README.md +11 -7

README.md CHANGED Viewed

@@ -3,22 +3,26 @@ license: mit
 tags:
 - code-understanding
 - unixcoder
 ---
-# RepoSim
-An approach to compare semantic similarities between Python repositories.
 ## Model Details
-**RepoSim** is a pipeline used to create embeddings for specified Python repositories on GitHub. For each repository, it extracts and encodes all functions' source code and docstrings into embeddings, then average them to get the mean code embeddings and the mean docstring embeddings, which can be used to perform various tasks such as cosine similarity comparison.
 ### Model Description
-The model used by **RepoSim** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset.
-- **Pipeline developed by:** [Lazyhope](https://huggingface.co/Lazyhope)
-- **Repository:** [RepoSim](https://github.com/RepoAnalysis/RepoSim)
 - **Model type:** **code understanding**
 - **Language(s):** **Python**
 - **License:** **MIT**
@@ -82,4 +86,4 @@ Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well a
 ## Authors
 - **Zihao Li** (https://github.com/lazyhope)
-- **Rosa Filgueira** (https://www.rosafilgueira.com)

 tags:
 - code-understanding
 - unixcoder
+pipeline_tag: feature-extraction
 ---
+# RepoSim4Py
+An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories.
 ## Model Details
+**RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories.
+For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository.
+By taking the mean of these embeddings, a repository-level mean embedding is generated.
+These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison.
 ### Model Description
+The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset.
+- **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65)
+- **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py)
 - **Model type:** **code understanding**
 - **Language(s):** **Python**
 - **License:** **MIT**
 ## Authors
 - **Zihao Li** (https://github.com/lazyhope)
+- **Rosa Filgueira** (https://www.rosafilgueira.com)