Ervinoreo commited on
Commit
dab98a9
·
0 Parent(s):

first commit

Browse files
Files changed (1) hide show
  1. README.md +259 -0
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PanSea University Search
2
+
3
+ An AI-powered RAG (Retrieval-Augmented Generation) system for searching ASEAN university admission requirements, designed to help prospective students find accurate and up-to-date information about study opportunities across Southeast Asia.
4
+
5
+ ## 🎯 Problem & Solution
6
+
7
+ **Problem:** Prospective students worldwide seeking to study abroad face difficulty finding accurate, up-to-date university admission requirements. Information is scattered across PDFs, brochures, and outdated agency websites. Many waste time applying to unsuitable programs due to missing criteria and pay high agent fees.
8
+
9
+ **Solution:** An LLM-powered, RAG-based platform powered by **SEA-LION multilingual models** that ingests official admissions documents from ASEAN universities. Students can query in any ASEAN language and receive ranked program matches with fees, entry requirements, deadlines, application windows, and source citations.
10
+
11
+ ## 🌟 Features
12
+
13
+ - 📄 **PDF Document Ingestion**: Upload official university admission documents
14
+ - 🔍 **Intelligent Search**: Natural language queries in multiple ASEAN languages
15
+ - 🎯 **Accurate Responses**: AI-powered answers with source citations
16
+ - 🔗 **Shareable Results**: Generate links to share query results
17
+ - 🌏 **Multi-language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
18
+ - 💰 **Advanced Filtering**: Budget range, study level, country preferences
19
+
20
+ ## 🚀 Quick Start
21
+
22
+ ### Prerequisites
23
+
24
+ - Python 3.11+
25
+ - SEA-LION API Key
26
+ - OpenAI API Key (optional, for fallback embeddings)
27
+
28
+ ### Installation
29
+
30
+ 1. **Clone and navigate to the project:**
31
+ ```bash
32
+ cd pansea
33
+ ```
34
+
35
+ 2. **Activate virtual environment:**
36
+ ```bash
37
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
38
+ ```
39
+
40
+ 3. **Install dependencies:**
41
+ ```bash
42
+ pip install -r requirements.txt
43
+ ```
44
+
45
+ 4. **Set up environment variables:**
46
+ ```bash
47
+ cp .env.example .env
48
+ # Edit .env and add your SEA-LION API key (OpenAI key optional for fallback)
49
+ ```
50
+
51
+ 5. **Run the application:**
52
+ ```bash
53
+ streamlit run app.py
54
+ ```
55
+
56
+ 6. **Open your browser to:** `http://localhost:8501`
57
+
58
+ ### Usage
59
+
60
+ #### 1. Upload Documents
61
+ - Go to the "Upload Documents" page
62
+ - Enter university name and country
63
+ - Select document type (admission requirements, tuition fees, etc.)
64
+ - Upload PDF files containing university information
65
+ - Click "Process Documents"
66
+
67
+ #### 2. Search Universities
68
+ - Go to the "Search Universities" page
69
+ - Choose your response language
70
+ - Enter questions like:
71
+ - "Show me universities in Malaysia for master's degrees with tuition under 40,000 RMB per year"
72
+ - "专科毕业,无雅思,想在马来西亚读硕士,学费不超过4万人民币/年"
73
+ - "What are the English proficiency requirements for Singapore universities?"
74
+ - Apply optional filters (budget, study level, countries)
75
+ - Get AI-powered responses with source citations
76
+
77
+ #### 3. Share Results
78
+ - Each query generates a unique shareable link
79
+ - Share results with friends, family, or education consultants
80
+ - Access shared results without needing to upload documents again
81
+
82
+ ## 📁 Project Structure
83
+
84
+ ```
85
+ pansea/
86
+ ├── app.py # Main Streamlit application
87
+ ├── rag_system.py # RAG system implementation
88
+ ├── requirements.txt # Python dependencies
89
+ ├── .env # Environment variables
90
+ ├── .venv/ # Virtual environment
91
+ ├── chroma_db/ # Vector database storage
92
+ ├── documents/ # Uploaded documents storage
93
+ ├── query_results/ # Shared query results
94
+ └── README.md # This file
95
+ ```
96
+
97
+ ## 🛠️ Core Components
98
+
99
+ ### DocumentIngestion Class
100
+ - Handles PDF text extraction using PyPDF2
101
+ - Creates document chunks with metadata
102
+ - Builds and persists ChromaDB vector store
103
+ - Manages document preprocessing and storage
104
+
105
+ ### RAGSystem Class
106
+ - Implements retrieval-augmented generation
107
+ - Uses BGE-small-en-v1.5 embeddings for semantic search (with OpenAI fallback)
108
+ - Leverages SEA-LION models for response generation:
109
+ - **SEA-LION v3.5 Reasoning Model** for complex university queries
110
+ - **SEA-LION v3 Instruct Model** for translation and simple questions
111
+ - Provides multilingual query support with automatic model selection
112
+
113
+ ### Streamlit UI
114
+ - Clean, intuitive interface
115
+ - Multi-page navigation
116
+ - File upload with progress tracking
117
+ - Advanced search filters
118
+ - Shareable query results
119
+
120
+ ## 🌏 Supported Languages
121
+
122
+ The system supports queries and responses in:
123
+ - **English** - Primary language
124
+ - **中文 (Chinese)** - For Chinese-speaking students
125
+ - **Bahasa Malaysia** - For Malaysian context
126
+ - **ไทย (Thai)** - For Thai students
127
+ - **Bahasa Indonesia** - For Indonesian students
128
+ - **Tiếng Việt (Vietnamese)** - For Vietnamese students
129
+ - **Filipino** - For Philippines context
130
+
131
+ ## 🎯 Target ASEAN Countries
132
+
133
+ - 🇸🇬 Singapore
134
+ - 🇲🇾 Malaysia
135
+ - 🇹🇭 Thailand
136
+ - 🇮🇩 Indonesia
137
+ - 🇵🇭 Philippines
138
+ - 🇻🇳 Vietnam
139
+ - 🇧🇳 Brunei
140
+ - 🇰🇭 Cambodia
141
+ - 🇱🇦 Laos
142
+ - 🇲🇲 Myanmar
143
+
144
+ ## 🔧 Configuration
145
+
146
+ ### Environment Variables (.env)
147
+ ```bash
148
+ # SEA-LION API Configuration
149
+ SEA_LION_API_KEY=your_sea_lion_api_key_here
150
+ SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
151
+
152
+ # OpenAI API Configuration (for embeddings)
153
+ OPENAI_API_KEY=your_openai_api_key_here
154
+
155
+ # Application Configuration
156
+ APP_NAME=PanSea University Search
157
+ APP_VERSION=1.0.0
158
+ CHROMA_PERSIST_DIRECTORY=./chroma_db
159
+ UPLOAD_FOLDER=./documents
160
+ MAX_FILE_SIZE_MB=50
161
+ ```
162
+
163
+ ### Customization Options
164
+ - **Chunk Size**: Adjust text splitting in `rag_system.py`
165
+ - **Retrieval Count**: Modify number of retrieved documents (default: 5)
166
+ - **Model Selection**: Configure SEA-LION model selection logic
167
+ - **UI Themes**: Modify CSS in `app.py`
168
+ - **Query Classification**: Adjust complex vs simple query detection
169
+
170
+ ## 📊 Example Queries
171
+
172
+ Try these sample queries to test the system and see different model usage:
173
+
174
+ ### Complex Queries (Uses SEA-LION Reasoning Model)
175
+
176
+ 1. **Multi-criteria Search**: "Show me universities in Thailand and Malaysia for engineering master's programs with tuition under $15,000 per year"
177
+
178
+ 2. **Chinese Query**: "专科毕业,无雅思,想在马来西亚读硕士,学费不超过4万人民币/年"
179
+
180
+ 3. **Comparative Analysis**: "Compare MBA programs in Singapore and Indonesia with GMAT requirements and scholarship opportunities"
181
+
182
+ ### Simple Queries (Uses SEA-LION Instruct Model)
183
+
184
+ 4. **Translation**: "How do you say 'application deadline' in Thai and Indonesian?"
185
+
186
+ 5. **Definition**: "What is the difference between IELTS and TOEFL?"
187
+
188
+ 6. **Basic Information**: "What does GPA stand for and how is it calculated?"
189
+
190
+ ## 🔍 Technical Stack
191
+
192
+ - **Backend**: Python 3.11, LangChain
193
+ - **LLM Models**:
194
+ - SEA-LION v3.5 8B Reasoning (complex queries)
195
+ - SEA-LION v3 9B Instruct (simple queries & translation)
196
+ - **Embeddings**: BGE-small-en-v1.5 (with OpenAI ada-002 fallback)
197
+ - **Vector Database**: ChromaDB with persistence
198
+ - **Frontend**: Streamlit with custom CSS
199
+ - **Document Processing**: PyPDF2, PyCryptodome (for encrypted PDFs), RecursiveCharacterTextSplitter
200
+
201
+ ## 📈 Roadmap
202
+
203
+ - [ ] Support for additional document formats (Word, Excel)
204
+ - [x] Integration with SEA-LION multilingual models
205
+ - [ ] Real-time web scraping of university websites
206
+ - [ ] Mobile-responsive design
207
+ - [ ] User authentication and query history
208
+ - [ ] Advanced analytics and insights
209
+ - [ ] Integration with university application systems
210
+ - [ ] Fine-tuning SEA-LION models on university-specific data
211
+
212
+ ## 🤝 Contributing
213
+
214
+ 1. Fork the repository
215
+ 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
216
+ 3. Commit your changes (`git commit -m 'Add amazing feature'`)
217
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
218
+ 5. Open a Pull Request
219
+
220
+ ## 📄 License
221
+
222
+ This project is licensed under the MIT License - see the LICENSE file for details.
223
+
224
+ ## 💡 Tips for Best Results
225
+
226
+ 1. **Upload Quality Documents**: Use official admission guides and requirements documents
227
+ 2. **Be Specific**: Include specific criteria in your queries (budget, location, program type)
228
+ 3. **Use Natural Language**: Ask questions as you would to a human counselor
229
+ 4. **Try Multiple Languages**: The system works well with mixed-language queries
230
+ 5. **Check Sources**: Always review the source documents cited in responses
231
+
232
+ ## 🆘 Troubleshooting
233
+
234
+ ### Common Issues
235
+
236
+ **"No documents found"**: Upload PDF documents first in the Upload Documents page
237
+
238
+ **"API Key not found"**: Add your SEA-LION API key to the .env file
239
+
240
+ **"No embeddings available"**: BGE embeddings are used by default. If issues occur, add your OpenAI API key for fallback embeddings
241
+
242
+ **"Import errors"**: Install dependencies using `pip install -r requirements.txt`
243
+
244
+ **"ChromaDB errors"**: Delete the `chroma_db` folder and restart the application
245
+
246
+ **"PyCryptodome is required for AES algorithm"**: This error occurs with encrypted PDFs. PyCryptodome is now included in requirements.txt
247
+
248
+ **"Could not extract text from PDF"**: This can happen with:
249
+ - Password-protected PDFs (provide unprotected versions)
250
+ - Scanned PDFs or image-based documents (consider OCR tools)
251
+ - Heavily encrypted or corrupted PDF files
252
+
253
+ ## 📞 Support
254
+
255
+ For support, please create an issue on GitHub or contact the development team.
256
+
257
+ ---
258
+
259
+ **Made with ❤️ for students seeking education opportunities in ASEAN** 🎓