# CoCoSoDa: Effective Contrastive Learning for Code Search

Our approach adopts the pre-trained model as the base code/query encoder and optimizes it using multimodal contrastive learning and soft data augmentation.

![1](Figure/CoCoSoDa.png)

CoCoSoDa is comprised of the following four components:
* **Pre-trained code/query encoder** captures the semantic information of a code snippet or a natural language query and maps it into a high-dimensional embedding space. 
as the code/query encoder.
* **Momentum code/query encoder** encodes the samples (code snippets or queries) of current and previous mini-batches to enrich the negative samples.

* **Soft data augmentation** is to dynamically mask or replace some tokens in a sample (code/query) to generate a similar sample as a form of data augmentation.

* **Multimodal contrastive learning loss function** is used as the optimization objective and consists of inter-modal and intra-modal contrastive learning loss. They are used to minimize the distance of the representations of similar samples and maximize the distance of different samples in the embedding space.



## Source code 
### Environment
```
conda create -n CoCoSoDa python=3.6 -y
conda activate CoCoSoDa
pip install torch==1.10  transformers==4.12.5 seaborn==0.11.2 fast-histogram nltk==3.6.5 networkx==2.5.1 tree_sitter tqdm prettytable gdown more-itertools tensorboardX sklearn  
```
### Data

```
cd dataset
bash get_data.sh 
```

Data statistic is shown in this Table. 

| PL         | Training | Validation  |  Test  | Candidate Codes|
| :--------- | :------: | :----: | :----: |:----: |
| Ruby       |  24,927  | 1,400  | 1,261  |4,360|
| JavaScript |  58,025  | 3,885  | 3,291  |13,981|
| Java       | 164,923  | 5,183  | 10,955 |40,347|
| Go         | 167,288  | 7,325  | 8,122  |28,120|
| PHP        | 241,241  | 12,982 | 14,014 |52,660|
| Python     | 251,820  | 13,914 | 14,918 |43,827|

It will take about 10min.

### Training and Evaualtion

We have uploaded the pre-trained model to  [huggingface](https://huggingface.co/). You can directly download [DeepSoftwareAnalytics/CoCoSoDa](https://huggingface.co/DeepSoftwareAnalytics/CoCoSoDa) and fine-tune it. 
#### Pre-training (Optional)

```
bash run_cocosoda.sh $lang 
```
The optimized model is saved in `./saved_models/cocosoda/`. You can upload them to [huggingface](https://huggingface.co/).

It will take about 3 days.

#### Fine-tuning


```
lang=java
bash run_fine_tune.sh $lang 
```
#### Zero-shot running

```
lang=python
bash run_zero-shot.sh $lang 
```


### Results	

#### The Model Evaluated with MRR 

| Model          |   Ruby    | Javascript |    Go     |  Python   |   Java    |    PHP    |  Avg.  |
| -------------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
| CoCoSoDa | **0.818**| **0.764**| **0.921** |**0.757**| **0.763**| **0.703** |**0.788**|

## Appendix

The description of baselines, addtional experimetal results and discussion are shown in `Appendix/Appendix.pdf`. 


## Contact
Feel free to contact Ensheng Shi (enshengshi@qq.com) if you have any further questions or no response to github issue for more than 1 day.