zjunlp
/

llama-molinst-protein-7b

+---
+license: apache-2.0
+tags:
+- chemistry
+- biology
+- protein
+- instructions
+---
+This repo contains a fully fine-tuned [LLaMA-7b](https://huggingface.co/decapoda-research/llama-7b-hf), trained on the 🧬 **protein-oriented instructions** from the 🧪 [Mol-Instructions](https://huggingface.co/datasets/zjunlp/Mol-Instructions) dataset.
+Instructions for running it can be found at https://github.com/zjunlp/Mol-Instructions.
+> Please refer to our [paper](https://arxiv.org/abs/2306.08018) for more details.
+![image.png](logo.png)
+<h3> 🧬 Tasks</h3>
+<details>
+  <summary><b>Protein design</b></summary>
+- *Design a protein sequence that exhibits the desired activity and specificity.*
+1. The presence of Mg(2+) is necessary for the protein to function in the desired environment.
+2. The AMP, (6S)-NADPHX binding site should be located in a region of the protein that is accessible to the ligand.
+3. The designed protein should have ATP binding, NADPHX epimerase activity, metal ion binding, ADP-dependent NAD(P)H-hydrate dehydratase activity to facilitate nicotinamide nucleotide metabolic process.
+4. For general function, the protein need meet that Catalyzes the epimerization of the S- and R-forms of NAD(P)HX, a damaged form of NAD(P)H that is a result of enzymatic or heat-dependent hydration
+  ```
+  MSNELVLSREQVRRVDQRAIEAYGVPGIVLMENAGRGAAEIIRAACPSAQRVLIACGPGNNGGDGFVIARHLANAGWMVELLLACPADRITGDAQGNHEIIRRMNLPCAVMADARDLEAANDRFATADVIVDALLGTGASGPPREPIASLIRAINEAHRRVSAQPAPSVFAVDIPSGLDCDTGEAANPTVRADHTITFVARKIGFRNPAARDLLGRVHVVDIGAPRAAIQDALTGKSG
+  ```
+</details>
+<details>
+  <summary><b>Catalytic activity prediction</b></summary>
+- *Please evaluate the following protein sequence and provide an explanation of the enzyme's catalytic activity, including the chemical reaction it facilitates:* MDKVAVAGFLPEELCASLSLSPSFRGNQIFQWIGKGVDSFDAMTNLSAELRASLAEKAILRSTRVSDVLKADDGTVKLQIQTEDDLAVETVLLTDKAARKTACVSCQAGCAMGCAFCKTGTLGLARNLSAAEIVEQFLYLEKHAGALDNIVFMGMGEPLLNLDALRKAIAVLTDKRGRNLSSRRITVSTVGIVSGIYDLANNGPDVRLAVSLTTADETLRRELMPASLTNPLSDLRQAISYYIEKTGKRVTLEAVLLSGKNTSEKNADSLIAFAKGLDVHVNLIPWNPVEGLSFVTPDPEETAQFVSRLEKGGLNVTLRMHRGKSISGACGQLGKTNPYA
+  ```
+  Based on the provided protein sequence, the enzyme appears to facilitate the chemical reaction: adenosine(37) in tRNA + 2 reduced [2Fe-2S]-[ferredoxin] + 2 S- adenosyl-L-methionine = 2-methyladenosine(37) in tRNA + 5'-   deoxyadenosine + L-methionine + 2 oxidized [2Fe-2S]-[ferredoxin] + S- adenosyl-L-homocysteine.
+  ```
+</details>
+<details>
+  <summary><b>Protein function prediction</b></summary>
+- *Analyze the following amino acid sequence, and determine the function of the resulting protein, its subcellular localization, and any biological processes it may be part of:*
+MNGTVNASAPSKMSEVAVERLSNDKALKVIFVLGGPGSGKGTQCAKIAKHFGFTHLSVGDLLRAEINSGSKNGTMIESMINEGKIVRSEVTIKLLQRAMHESGNDKFLIDGFPRNEENRAAFENLEKIEPEFVLFFDCPMEEMERRILNRNQGRDDDKMETIRKRFKVFIESTLPVIEFYNLKGKLYKIDACKPADEVFEDVKAIFSRFRAKEDSSQQTNICTAKRFELVMCLIKRLFREIKRMWSSFFCKAL
+  ```
+  The protein characterized by the amino acid sequence demonstrates ATP binding, cytidylate kinase activity, uridylate kinase activity and is implicated in the 'de novo' pyrimidine nucleobase biosynthetic process, phosphorylation, pyrimidine nucleotide biosynthetic process. Its subcellular localization is primarily within the cytoplasm, nucleus.
+  ```
+</details>
+<details>
+  <summary><b>Functional description generation</b></summary>
+- *Examine the given protein sequence and share a brief overview of its attributes:*
+MKIVLASNNQGKLAELKAMLAPLGVQLLRQAELGIPEAAEPFRTFVENALAKARHASALSGLPALADDAGLCVEAFGGLPGVDTAFYAVQFGYAKGDANNVKALLAQMAGITDRRAALVSTLVAVRSAEDPEPLIACGRVAGEVALEPMGSNGFGFDPVMFIPEFGQTFAQLPVEVKNANSHRGKATRQMMALMRERWIT
+  ```
+  A concise description of the protein with the specified amino acid sequence includes: Pyrophosphatase that catalyzes the hydrolysis of nucleoside triphosphates to their monophosphate derivatives, with a high preference for the non-canonical purine nucleotides XTP (xanthosine triphosphate), dITP (deoxyinosine triphosphate) and ITP. Seems to function as a house-cleaning enzyme that removes non-canonical purine nucleotides from the nucleotide pool, thus preventing their incorporation into DNA/RNA and avoiding chromosomal lesions.
+  ```
+</details>
+<details>
+  <summary><b>Domain/Motif prediction</b></summary>
+- *Given this protein sequence, can you identify any common protein motifs or domains that it may contain?*
+MANTKYIFITGGVVSSLGKGIAAASIGALLESRGLSVSLIKVDPYINVDPGTMSPFQHGEVFVTEDGTETDLDLGHYERFVRFKASKKNNFTAGKVYETVIRNERKGNYLGGTVQVIPHITNEIKKRIKKGGQNKDIAIVEVGGTVGDIESQPFVEALRQMALELPNSSWAFVHLTLVPFINASGELKTKPTQHSVKELRSLGISPDVLVCRSEQELPKDEKNKIALFCSVPAKSVISMHDVDTVYSIPILLNKQKVDDTILKKLNLKIKKPNLNDWKRVVKAKLLPEKEVNVSFVGKYTELKDSYKSINEALEHAGIQNKAKVNINFVEAEQITSQNVRKVLKKSDAILVPGGFGERGIEGMILACKYARENNVPYLGICLGMQIAIIEYARNVLKLKSANSTEFDSSTKFPVIGLITEWSDISGKKEKRTKNSDLGGTMRLGGQVCKLKKKSNSYKMYKKSEIIERHRHRYEVNPNYKDKMIEQGLDVVGTSIDGKLVEMIELPSHKWFLACQFHPEFTSNPRDGHPIFNSYIKSTITK
+  ```
+  Our predictive analysis of the given protein sequence reveals possible domains or motifs. These include: Glutamine amidotransferase, CTP synthase N-terminal domains.
+  ```
+</details>
+<h3> 📝 Demo</h3>
+As illustrated in [our repository](https://github.com/zjunlp/Mol-Instructions/tree/main/demo), we provide an example to perform generation.
+For model fine-tuned on **protein-oriented** instructions, you can conveniently recover the model weights we trained through the following command.
+Please download [llama-7b-hf](https://huggingface.co/decapoda-research/llama-7b-hf/tree/main) to obtain the pre-training weights of LLaMA-7B, refine the `--base_model` to point towards the location where the model weights are saved.
+Then replace `$DIFF_WEIGHT_PATH` with the path of our provided [diff weights](https://huggingface.co/zjunlp/llama-molinst-protein-7b), and replace `$RECOVER_WEIGHT_PATH` with the desired path to save the recovered weights. If the directory of recovered weights lacks required files (e.g., tokenizer configuration files), you can copy from `$DIFF_WEIGHT_PATH`.
+```shell
+python weight_diff.py recover \
+  --path_raw $BASE_MODEL_PATH \
+  --path_diff $DIFF_WEIGHT_PATH \
+  --path_tuned $RECOVER_WEIGHT_PATH
+```
+After that, you can execute the following command to generate outputs with the fine-tuned LLaMA model.
+```shell
+>> python generate.py \
+    --CLI True \
+    --protein True \
+    --base_model $RECOVER_WEIGHT_PATH \
+```
+<h3> 🚨 Limitations</h3>
+The current state of the model, obtained via instruction tuning, is a preliminary demonstration. Its capacity to handle real-world, production-grade tasks remains limited.
+<h3> 📚 References</h3>
+If you use our repository, please cite the following related paper:
+```
+@article{molinst,
+  title={Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models},
+  author={Fang, Yin and Liang, Xiaozhuan and Zhang, Ningyu and Liu, Kangwei and Huang, Rui and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
+  journal={arXiv preprint arXiv:2306.08018},
+  year={2023}
+}
+```
+<h3> 🫱🏻‍🫲 Acknowledgements</h3>
+We appreciate [LLaMA](https://github.com/facebookresearch/llama), [Huggingface Transformers Llama](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama), [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [Alpaca-LoRA](https://github.com/tloen/alpaca-lora), [Chatbot Service](https://github.com/deep-diver/LLM-As-Chatbot) and many other related works for their open-source contributions.