Om Surve commited on
Commit
dc3eeb1
·
unverified ·
1 Parent(s): ce5b6dd

Research docs

Browse files
Files changed (1) hide show
  1. docs.md +80 -0
docs.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### 1. Data Input:
2
+
3
+ - **Input Data:** Collect a diverse dataset of academic papers, articles, or textual content from various sources.
4
+ - **Format:** Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library.
5
+
6
+ ### 2. Data Cleaning:
7
+
8
+ - **Text Cleaning:**
9
+ - Remove metadata, formatting, and irrelevant details.
10
+ - Handle special characters, punctuation, and stopwords.
11
+
12
+ - **Normalization:**
13
+ - Convert text to lowercase to ensure uniformity.
14
+
15
+ - **Tokenization:**
16
+ - Tokenize the text into words or subword tokens.
17
+ - **Libraries:**
18
+ - For Python, you can use NLTK or spaCy for tokenization.
19
+
20
+ ### 3. Embedding Generation:
21
+
22
+ - **Word Level Embeddings:**
23
+ - Utilize pre-trained word embeddings like Word2Vec or GloVe.
24
+ - **Libraries:**
25
+ - For Word2Vec: Gensim library.
26
+ - For GloVe: spaCy or gensim.
27
+
28
+ - **Paragraph Level Embeddings:**
29
+ - Aggregate word embeddings using techniques like averaging or using Doc2Vec.
30
+ - **Libraries:**
31
+ - Gensim for Doc2Vec.
32
+
33
+ - **Document Level Embeddings:**
34
+ - Consider using the average of paragraph embeddings or more advanced models.
35
+ - **Libraries:**
36
+ - spaCy or transformers library for more advanced models.
37
+
38
+ ### 4. Pairwise Comparison:
39
+
40
+ - **Similarity Measures:**
41
+ - Calculate cosine similarity, Jaccard similarity, or other relevant measures.
42
+ - **Libraries:**
43
+ - scikit-learn for cosine similarity.
44
+
45
+ ### 5. Clustering:
46
+
47
+ - **K-Means Clustering:**
48
+ - Partition documents into K clusters.
49
+ - **Libraries:**
50
+ - scikit-learn for K-Means.
51
+
52
+ - **Hierarchical Clustering:**
53
+ - Build a hierarchy of clusters.
54
+ - **Libraries:**
55
+ - scipy.cluster.hierarchy for hierarchical clustering.
56
+
57
+ - **DBSCAN:**
58
+ - Density-based clustering.
59
+ - **Libraries:**
60
+ - scikit-learn for DBSCAN.
61
+
62
+ ### 6. Scoring System:
63
+
64
+ - **Threshold Setting:**
65
+ - Establish a threshold for similarity scores to classify documents.
66
+ - Determine the threshold through experimentation.
67
+
68
+ - **Scoring Logic:**
69
+ - Develop a scoring system based on the results of pairwise comparison and clustering.
70
+ - Decide on the scoring weights for each component.
71
+
72
+ ### 7. Hybrid Approach:
73
+
74
+ - **Traditional Models:**
75
+ - Use traditional similarity measures for efficiency.
76
+ - Implement efficient algorithms for quick pairwise comparisons.
77
+
78
+ - **Large Language Models:**
79
+ - Fine-tune or use pre-trained models for enhanced context understanding.
80
+ - Hugging Face Transformers library for accessing pre-trained models.