SkimLit_NLP / README.md

Update README.md

7caa0c4 verified 9 months ago

3.72 kB

	---
	license: mit
	language:
	- en
	---

	# SkimLit: NLP Model for Medical Abstracts
	SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.

	# Project Overview

	# `Section 1`

	## Data Collection
	- The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
	```
	git clone https://github.com/Franck-Dernoncourt/pubmed-rct
	cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
	```

	## Data Prepocessing
	- Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
	- Three baseline models are established to set the foundation for more complex models.

	## Baseline Model (Model 0)
	- TF-IDF Multinomial Naive Bayes Classifier is implemented.
	- Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.

	## Deep Sequence Models
	### Model 1: Conv1D with Token Embeddings
	- Custom TextVectorizer and text embedding layers are created.
	- Data is optimized for efficiency using TensorFlow tf.data API.

	### Model 2: Pretrained Token Embeddings
	- Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.

	### Model 3: Conv1D with Character Embeddings
	- Character-level tokenizer and embedding are implemented.
	- Conv1D model is constructed using character embeddings.

	### Model 4: Hybrid Embedding Layer
	- Token and character-level embeddings are combined using layers.Concatenate.
	- A model is developed to process both types of embeddings and output label probabilities.

	### Model 5: Transfer Learning with Positional Embeddings
	- Positional embeddings are introduced to enhance the model's understanding of the sequence.
	- A tribrid embedding model is created, combining token, character, line_number, and total_lines features.

	## Model Evaluation and Comparison
	- Models are evaluated on various datasets to compare their performance.

	## Save and Load Models
	- Models are saved and loaded for future use.

	## Model Loading and Evaluation
	- Pre-trained models are loaded and evaluated on validation datasets.

	## Test Dataset Processing and Prediction
	- A test dataset is created, preprocessed, and used for making predictions with the loaded model.

	## Enriching Test Dataframe with Predictions
	- Predictions and additional columns are added to the test dataframe for analysis.

	## Finding Top Wrong Predictions
	- The top 100 most inaccurately predicted samples are identified.

	## Investigating Top Wrong Predictions
	- Detailed information on the top 10 wrong predictions is displayed.

	# `Section 2`
	## Example Abstracts
	- Example abstracts are downloaded from a GitHub repository.

	## Processing Example Abstracts with spaCy
	- spaCy is used to parse sentences from example abstracts.

	## One-Hot Encoding and Prediction on Example Abstracts
	- Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.

	## Visualizing Predictions on Example Abstracts
	- Predicted sequence labels for each line in the abstract are displayed.

	# Conclusion
	- SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.

	- Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.