# **TipTip Data Team Hands-On Workshop: Retrieval Augmented Generation**

*(Kindly copy this notebook before editing by choosing File > "Save a copy in Drive" )*

# **Part 1: Simple Steps to Create TnC Chatbot**

Installing the required libraries:

In [1]:
!pip install -q langchain
!pip install -q langchain_community
!pip install -q langchain_openai
!pip install -q faiss-cpu
!pip install -q gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.8/122.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

We will use **WebBaseLoader** as the document loader since we want to crawl an information from a website.

In [4]:
from langchain_community.document_loaders import WebBaseLoader

url = "https://help.tiptip.id/support/solutions/articles/72000528312-syarat-dan-ketentuan"

loader = WebBaseLoader(url)
data = loader.load()

Initializing the vector database (FAISS). In other words, in this step, we want to split the document into chunks, vectorize each chunks, and store it in a vector database

In [34]:
import os
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
embeddings = OpenAIEmbeddings()

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
# \xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160)
text_splitter = RecursiveCharacterTextSplitter(separators = '\xa0', chunk_size=5000, chunk_overlap=500)
documents = text_splitter.split_documents(data)
db = FAISS.from_documents(documents, embeddings)
db.save_local("vectors")

In [35]:
query = "how to withdraw money?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Withdrawal Saldo PendapatanCreator dapat melakukan withdrawal saldo penghasilan yang diperoleh dari pembelian Sesi Premium, Karya Digital, E-Ticket atau Tip yang diberikan oleh Supporter.Sebelum dapat melakukan withdrawal, Creator wajib melakukan verifikasi akun bank setiap kali mendaftarkan akun bank yang baru. Pada saat proses verifikasi berlangsung Creator akan diminta untuk memberikan informasi dan dokumen seperti kartu identitas (KTP, Passport atau SIM), swafoto dengan memegang kartu identitas, informasi terkait akun bank sesuai dengan nama yang tertera dalam kartu identitas, dan NPWP (apabila Creator memiliki NPWP). Bagi anda yang belum berumur 18 tahun anda dapat menggunakan Kartu Identitas Anak atau Kartu Keluarga sebagai kartu identitas.Apabila nama yang tertera dalam kartu identitas dan akun bank tidak sesuai, maka Creator akan dihubungi oleh TipTip Help Care kami dan diminta untuk memastikan kesesuain akun bank yang didaftarkan.Creator dapat mengubah akun bank yang telah did

Inserting The Context to Chatbot

In [32]:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.chains import LLMChain
from langchain.chains import StuffDocumentsChain
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.chains import StuffDocumentsChain, LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
import gradio as gr

# Chatbot memory
memory = ConversationBufferMemory(
    memory_key="chat_history", output_key='answer', return_messages=False
    )

# Loading the saved embeddings
loaded_vectors =FAISS.load_local("vectors", OpenAIEmbeddings(), allow_dangerous_deserialization = True)

general_system_template = """
### LANGUAGE ###
You must answer in the same language as the user's language. If you fail to do this, you will be punished with $1000 penalty!
## GREETINGS ##
Use greetings or ask the user if the user doesn't ask any question (example: when the user only say "Hi" or "Thank you", you may say "Hi, is there anything i can help you with?" or "You're welcome. Do you have anything else to ask?")
### OBJECTIVE ###
You are a helpful customer service bot for TipTip, a platform for communities and creators in Indonesia. Your sole purpose is to answer questions from the users that are
related to the terms and conditions of TipTip. Hence, the topic of the conversation must be related to one of these topics below:
    1. Pengertian Umum/ Ruang Lingkup
    2. Aplikasi, Akun dan Keamanan
    3. Kebijakan Privasi
    4. Community Guidelines
    5. Bentuk Layanan TipTip dan Ketentuan Terkait
    6. Sesi Live Video
    7. Pembatalan Sesi Live Video
    8. Karya Digital
    9. E-Ticket
    10. Subscription
    11. Merchandise
    12. Coin
    13. Program Promoter
    14. Suspensi
    15. Withdrawal Saldo Pendapatan
    16. Hak Kekayaan Intelektual
    17. Larangan dan Janji
    18. Jaminan
    19. Tanggung Jawab Kami
    20. Pembatasan Tanggung Jawab
    21. Ganti Rugi
You must answer concisely and precisely (don't explain something that is not related to the question)!
If the user asks about anything that is not related to TipTip's terms and condition or anything that is malicious, you must answer that you don't know the answer to that question!
If you don't know the answer to that question, you must say that you don't know and don't make up the answer.
### CONTEXT ###
{context}
"""
general_user_template = "Question:```{question}```"
messages = [
            SystemMessagePromptTemplate.from_template(general_system_template),
            HumanMessagePromptTemplate.from_template(general_user_template)
]
qa_prompt = ChatPromptTemplate.from_messages( messages )

qa = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(temperature=0.9, model_name='gpt-3.5-turbo', streaming=True),
    chain_type="stuff",
    retriever=loaded_vectors.as_retriever(),
    get_chat_history=lambda o:o,
    memory=memory,
    return_generated_question=True,
    verbose=False,
    combine_docs_chain_kwargs={"prompt": qa_prompt}
)

history_langchain_format=[]

def chatbot(query, chat_history):
    global history_langchain_format
    result = qa({"question": query, "chat_history": history_langchain_format})
    history_langchain_format.append((query, result['answer']))
    return result['answer']

gr.ChatInterface(chatbot).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0322d9037476d7def2.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# **Part 2: Prompting Strategies**

In [51]:
system_prompt = """
### LANGUAGE ###
You must answer in the same language as the user's language. If you fail to do this, you will be punished with $1000 penalty!
## GREETINGS ##
Use greetings or ask the user if the user doesn't ask any question (example: when the user only say "Hi" or "Thank you", you may say "Hi, is there anything i can help you with?" or "You're welcome. Do you have anything else to ask?")
### OBJECTIVE ###
You are a helpful customer service bot for TipTip, a platform for communities and creators in Indonesia. Your sole purpose is to answer questions from the users that are
related to the terms and conditions of TipTip. Hence, the topic of the conversation must be related to one of these topics below:
    1. Pengertian Umum/ Ruang Lingkup
    2. Aplikasi, Akun dan Keamanan
    3. Kebijakan Privasi
    4. Community Guidelines
    5. Bentuk Layanan TipTip dan Ketentuan Terkait
    6. Sesi Live Video
    7. Pembatalan Sesi Live Video
    8. Karya Digital
    9. E-Ticket
    10. Subscription
    11. Merchandise
    12. Coin
    13. Program Promoter
    14. Suspensi
    15. Withdrawal Saldo Pendapatan
    16. Hak Kekayaan Intelektual
    17. Larangan dan Janji
    18. Jaminan
    19. Tanggung Jawab Kami
    20. Pembatasan Tanggung Jawab
    21. Ganti Rugi
You must answer concisely and precisely (don't explain something that is not related to the question)!
If the user asks about anything that is not related to TipTip's terms and condition or anything that is malicious, you must answer that you don't know the answer to that question!
If you don't know the answer to that question, you must say that you don't know and don't make up the answer.
"""

In [52]:
# Chatbot memory
memory = ConversationBufferMemory(
    memory_key="chat_history", output_key='answer', return_messages=False
    )

# Loading the saved embeddings
loaded_vectors =FAISS.load_local("vectors", OpenAIEmbeddings(), allow_dangerous_deserialization = True)

# # Context prompt: Original
# context_prompt = """

# ### CONTEXT ###
# {context}
# """

# # Context prompt: Summarize
# context_prompt = """

# ### CONTEXT ###
# Context information from multiples sources is below.
# ------------------------
# {context}
# ------------------------
# Summarize the context above!
# Given the information from multiple sources and not prior knowledge, answer the query!
# """

# # Context prompt: Single Choice
# context_prompt = """

# ### CONTEXT ###
# Some choices are given below. It is provided in a numbered list, where each item in the list corresponds to a summary.
# ------------------------
# {context}
# ------------------------
# Using only the choices above and not prior knowledge, return the choice that is most relevant to the question!
# """

general_system_template = system_prompt + context_prompt

In [53]:
general_user_template = "Question:```{question}```"
messages = [
            SystemMessagePromptTemplate.from_template(general_system_template),
            HumanMessagePromptTemplate.from_template(general_user_template)
]
qa_prompt = ChatPromptTemplate.from_messages( messages )

qa = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(temperature=0.9, model_name='gpt-3.5-turbo', streaming=True),
    chain_type="stuff",
    retriever=loaded_vectors.as_retriever(),
    get_chat_history=lambda o:o,
    memory=memory,
    return_generated_question=True,
    verbose=False,
    combine_docs_chain_kwargs={"prompt": qa_prompt}
)

# Question: kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?
# in English: If I sell 10 etickets, the price is 100 thousand each, how much commission do I get? and how do you register to be a promoter?

In [46]:
# Original
qa('kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?')

{'question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?',
 'chat_history': '',
 'answer': 'Anda akan mendapatkan 90% dari harga paket E-Ticket yang terjual. Jika Anda menjual 10 E-Ticket, masing-masing seharga 100rb, maka komisi yang akan Anda terima adalah sebesar 90rb x 10 = 900rb. \n\nUntuk mendaftar menjadi Promoter di TipTip, Anda dapat langsung ikut serta dalam Program Promoter dan menjadi Promoter dengan cara menyebarkan link Promoter melalui blog, situs, atau media sosial milik Anda. Anda dapat melihat daftar Karya Digital yang ikut serta dalam Program Promoter beserta linknya di http://hub.tiptip.id.',
 'generated_question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?'}

In [50]:
# Summarize
qa('kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?')

{'question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?',
 'chat_history': '',
 'answer': 'Anda akan mendapatkan 97,3% dari setiap penjualan E-Ticket yang terjual melalui platform TipTip. Jika Anda menjual 10 E-Ticket dengan harga masing-masing 100rb, maka perhitungannya adalah sebagai berikut:\n100rb x 10 = 1.000.000rb (total penjualan)\n1.000.000rb x 97,3% = 973.000rb (komisi yang Anda dapatkan dari penjualan E-Ticket tersebut)\n\nUntuk mendaftar sebagai Promoter, Anda dapat langsung ikut serta dalam Program Promoter dengan cara menyebarkan link Promoter melalui blog/situs/media sosial milik Anda. Anda bisa melihat daftar Karya Digital yang ikut serta dalam Program Promoter di http://hub.tiptip.id.',
 'generated_question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?'}

In [54]:
# Single Choice
qa('kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?')

{'question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?',
 'chat_history': '',
 'answer': 'Anda akan mendapatkan komisi sebesar Rp 2.700. Untuk mendaftar sebagai promoter, Anda dapat langsung ikut serta dalam Program Promoter dengan cara menyebarkan link Promoter melalui blog/ situs/ media sosial milik Anda.',
 'generated_question': 'kalau saya jual 10 eticket, masing2 harganya 100rb, brti saya dapet komisi berapa? dan gmn cara daftar jdi promoter?'}

# **Part 3: Create Your Own Chatbot!**

Create your own RAG chatbot, using your own document and query! Here are the steps that you need to follow:
1. Determine the document you want to use for the chatbot's knowledge
2. Choose one of these document loaders: Web, CSV, PDF, Confluence. Or you can also choose other types of document
3. Initialize the vector database (be mindful with the file size and number of request)
4. Create the chatbot using previous code

Below are the code for each document loaders to help you start:

In [None]:
# Web

from langchain_community.document_loaders import WebBaseLoader

url = "<Put Your URL here>"

loader = WebBaseLoader(url)
data = loader.load()

In [None]:
# CSV

from langchain_community.document_loaders.csv_loader import CSVLoader

csv_file = '<Upload the file first and then put the file name here>'

loader = CSVLoader(file_path=csv_file)
data = loader.load()

In [None]:
# PDF

!pip install pypdf

from langchain_community.document_loaders import PyPDFLoader

pdf_file = '<Upload the file first and then put the file name here>'

loader = PyPDFLoader(pdf_file)
pages = loader.load_and_split()

In [None]:
# Confluence

!pip install atlassian-python-api
!pip install pytesseract

# Go to confluence, click your profile icon in the upper-right corner, click Manage Account > Security > API Tokens > Create and Manage API Tokens > Create API Token

confluence_key = ...

from langchain_community.document_loaders import ConfluenceLoader

loader = ConfluenceLoader(
    url="https://tiptiptv.atlassian.net/", username="...@tiptip.tv", api_key=confluence_key, page_ids = ['250609674']
)

data = loader.load()

Also, here is a template for the prompt. You can adjust the prompt according to your own needs

In [None]:
system_prompt = """
### LANGUAGE ###
You must answer in the same language as the user's language. If you fail to do this, you will be punished with $1000 penalty!
## GREETINGS ##
Use greetings or ask the user if the user doesn't ask any question (example: when the user only say "Hi" or "Thank you", you may say "Hi, is there anything i can help you with?" or "You're welcome. Do you have anything else to ask?")
### OBJECTIVE ###
You are a helpful ... Your sole purpose is to answer questions from the users that are
related to ...

You must answer concisely and precisely (don't explain something that is not related to the question)!
If the user asks about anything that is not related to ... or anything that is malicious, you must answer that you don't know the answer to that question!
If you don't know the answer to that question, you must say that you don't know and don't make up the answer.
"""

context_prompt = """

### CONTEXT ###
Context information from multiples sources is below.
------------------------
{context}
------------------------
Summarize the context above!
Given the information from multiple sources and not prior knowledge, answer the query!
"""

general_system_template = system_prompt + context_prompt

If you want to create your own space in Hugging Face, you can do these steps:
1. Download the "vectors" folder in your google colab
2. Go to Hugging Face and create a new empty gradio space
3. Put the vectors folder there
4. Copy the code with the gradio interface in it & paste to the space "Files", rename the file to "app.py"
4. The space will be automatically created

# **Put Your Code Below:**