File size: 4,713 Bytes
f514f40
 
038eba5
 
 
cdbc2dc
f514f40
038eba5
cdbc2dc
038eba5
cdbc2dc
038eba5
 
 
cdbc2dc
 
 
 
038eba5
 
 
cdbc2dc
038eba5
cdbc2dc
 
038eba5
 
 
 
 
 
 
 
 
 
 
bf37ebb
038eba5
 
 
 
 
bf37ebb
038eba5
 
 
 
 
 
5eb1b80
038eba5
 
 
 
 
bf37ebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
038eba5
bf37ebb
038eba5
 
 
 
 
 
 
bf37ebb
 
038eba5
 
 
 
 
 
 
 
bf37ebb
cdbc2dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: mit
tags:
- code-understanding
- unixcoder
pipeline_tag: feature-extraction
---

# RepoSim4Py

An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories.

## Model Details

**RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories.
For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository. 
By taking the mean of these embeddings, a repository-level mean embedding is generated.
These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison.

### Model Description

The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset.

- **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65)
- **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py)
- **Model type:** **code understanding**
- **Language(s):** **Python**
- **License:** **MIT**

### Model Sources

- **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder)
- **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf)

## Uses

Below is an example of how to use the RepoSim4Py pipeline to easily generate embeddings for GitHub Python repositories.

First, initialise the pipeline:
```python
from transformers import pipeline

model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True)
```
Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries:
```python
repo_infos = model("lazyhope/python-hello-world")
print(repo_infos)
```
Output (Long numpy array outputs are omitted):
```python
[{'name': 'lazyhope/python-hello-world',
  'topics': [],
  'license': 'MIT',
  'stars': 0,
  'code_embeddings': array([[-2.07551336e+00,  2.81387949e+00,  2.35216689e+00, ...]], dtype=float32),
  'mean_code_embedding': array([[-2.07551336e+00,  2.81387949e+00,  2.35216689e+00, ...]], dtype=float32),
  'doc_embeddings': array([[-2.37494540e+00,  5.40957630e-01,  2.29580235e+00, ...]], dtype=float32),
  'mean_doc_embedding': array([[-2.37494540e+00,  5.40957630e-01,  2.29580235e+00, ...]], dtype=float32),
  'requirement_embeddings': array([[0., 0., 0., ...]], dtype=float32),
  'mean_requirement_embedding': array([[0., 0., 0., ...]], dtype=float32),
  'readme_embeddings': array([[-2.1671042 ,  2.8404987 ,  1.4761417 , ...]], dtype=float32),
  'mean_readme_embedding': array([[-1.91171765e+00,  1.65386486e+00,  9.49612021e-01, ...]], dtype=float32),
  'mean_repo_embedding': array([[-2.0755134,  2.8138795,  2.352167 , ...]], dtype=float32),
  'code_embeddings_shape': (1, 768)
  'mean_code_embedding_shape': (1, 768)
  'doc_embeddings_shape': (1, 768)
  'mean_doc_embedding_shape': (1, 768)
  'requirement_embeddings_shape': (1, 768)
  'mean_requirement_embedding_shape': (1, 768)
  'readme_embeddings_shape': (3, 768)
  'mean_readme_embedding_shape': (1, 768)
  'mean_repo_embedding_shape': (1, 3072)
}]
```
More specific information please refer to [Example.py](https://github.com/RepoMining/RepoSim4Py/blob/main/Script/Example.py). Note that "github_token" is unnecessary.

## Training Details

Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task.

## Evaluation

We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 400 Python repositories categorized in different topics, in order to label similar repositories.
The evaluation metrics and results can be found in the RepoSim4Py repository, under the [Embedding](https://github.com/RepoMining/RepoSim4Py/tree/main/Embedding) folder. 

## Acknowledgements 
Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline.
- **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder)
- **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv)
- **awesome-python** (https://github.com/vinta/awesome-python)

## Authors
- **Honglin Zhang** (https://github.com/liaomu0926)
- **Rosa Filgueira** (https://www.rosafilgueira.com)