{"cells":[{"cell_type":"markdown","id":"c3d0765e","metadata":{"id":"c3d0765e"},"source":["### Fake News Classification using machine learning\n","**Project Workflow**\n","0. Installing required libraies from requirements.txt for project \n","1. Problem Statement\n","2. Data Collection\n","3. Importing required libraries and installations\n","4. Importing data as csv file\n","5. Exploratory Data Analysis\n","6. Data Preparation and Preprocessing\n"," - Tokenization\n"," - Lower case conversion\n"," - Stopwords\n"," - Lemmatization/Stemming\n","7. Vectorization with TFIDF\n","8. Data splittin into train and test\n","9. Model Building and Evaluation\n"," - 9.1 Cross validation\n"," - 9.2 Training data with best model from cross validation\n"," - 9.3. Evaluating the performance of trained model on test data\n"," - 9.4 Saving and loading model\n","10. Prediction pipeline\n"]},{"cell_type":"markdown","source":["#### 0. Installing required libraies from requirements.txt for project "],"metadata":{"id":"L5r8W5UkBt1x"},"id":"L5r8W5UkBt1x"},{"cell_type":"code","source":["!pip freeze > requirements.txt"],"metadata":{"id":"0OD5ZBRGHsoR","executionInfo":{"status":"ok","timestamp":1683411294453,"user_tz":-60,"elapsed":3121,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"0OD5ZBRGHsoR","execution_count":7,"outputs":[]},{"cell_type":"markdown","id":"bc956f4a","metadata":{"id":"bc956f4a"},"source":["#### 1. Problem Statement\n","The prevalence of fake news on online media platforms is increasingly recognized as a critical concern for society, as it has the potential to manipulate public opinion, incite social unrest, and erode trust in credible news sources. To mitigate the negative impact of fake news, the development of an effective classification system capable of accurately identifying and filtering false news articles from genuine sources is essential. However, building such a system presents numerous challenges, including the multifaceted and intricate nature of news content, the rapid dissemination of fake news through social media channels, and the possibility of biased or incomplete datasets. As such, it is imperative to develop an accurate and reliable fake news classification system leveraging advanced natural language processing (NLP) techniques. By achieving this, we can safeguard the integrity of news and promote public trust in media outlets.\n"]},{"cell_type":"markdown","id":"c76d48fc","metadata":{"id":"c76d48fc"},"source":["#### 2. Data collection\n","The datasets is available on kaggle in the provided link \n","https://www.kaggle.com/c/fake-news/data?select=train.csv\n"]},{"cell_type":"markdown","id":"89f5a900","metadata":{"id":"89f5a900"},"source":["#### 3. Importing required libraries and installations"]},{"cell_type":"code","source":["!pip install catboost"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"2Pv7sgrZeIi1","executionInfo":{"status":"ok","timestamp":1683400206764,"user_tz":-60,"elapsed":5288,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"f685c0c3-5440-475a-f63a-e624c844ea0a"},"id":"2Pv7sgrZeIi1","execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: catboost in /usr/local/lib/python3.10/dist-packages (1.2)\n","Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.22.4)\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1)\n","Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.1)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.10.1)\n","Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0)\n","Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.5.3)\n","Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.13.1)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2022.7.1)\n","Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2.8.2)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (8.4.0)\n","Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9)\n","Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7)\n","Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4)\n","Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.11.0)\n","Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2)\n"]}]},{"cell_type":"code","execution_count":45,"id":"9f08e579","metadata":{"id":"9f08e579","executionInfo":{"status":"ok","timestamp":1683405864858,"user_tz":-60,"elapsed":300,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[],"source":["import numpy as np\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import seaborn as sns\n","from typing import Dict\n","import re\n","import os\n","import nltk\n","import joblib\n","from nltk.corpus import stopwords\n","from nltk.stem import PorterStemmer, WordNetLemmatizer\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.model_selection import cross_val_score\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import accuracy_score, f1_score, roc_auc_score,classification_report,confusion_matrix\n","import xgboost\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.naive_bayes import GaussianNB\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.feature_extraction.text import TfidfVectorizer\n","from sklearn.metrics import accuracy_score, confusion_matrix,classification_report\n","import catboost"]},{"cell_type":"markdown","id":"81cb7d2f","metadata":{"id":"81cb7d2f"},"source":["#### 4. Importing data as csv file"]},{"cell_type":"code","source":["#connect to goole drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"eDnbf_8bldN6","executionInfo":{"status":"ok","timestamp":1683396535999,"user_tz":-60,"elapsed":2857,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"ccda467b-b182-484a-c994-25de624ef806"},"id":"eDnbf_8bldN6","execution_count":2,"outputs":[{"output_type":"stream","name":"stdout","text":["Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"]}]},{"cell_type":"code","source":["#unzip data file \n","!unzip -q \"/content/drive/MyDrive/Upwork/Fake_news/Data/fake_news_train.csv.zip\""],"metadata":{"id":"mrZlCYl7mJox","executionInfo":{"status":"ok","timestamp":1683394309448,"user_tz":-60,"elapsed":3374,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"mrZlCYl7mJox","execution_count":3,"outputs":[]},{"cell_type":"code","execution_count":3,"id":"5fe790df","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"5fe790df","executionInfo":{"status":"ok","timestamp":1683400220100,"user_tz":-60,"elapsed":2113,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"4e71c45c-cd41-4972-de31-8da244bc8a3e"},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" id title author \\\n","0 0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus \n","1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn \n","2 2 Why the Truth Might Get You Fired Consortiumnews.com \n","3 3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss \n","4 4 Iranian woman jailed for fictional unpublished... Howard Portnoy \n","\n"," text label \n","0 House Dem Aide: We Didn’t Even See Comey’s Let... 1 \n","1 Ever get the feeling your life circles the rou... 0 \n","2 Why the Truth Might Get You Fired October 29, ... 1 \n","3 Videos 15 Civilians Killed In Single US Airstr... 1 \n","4 Print \\nAn Iranian woman has been sentenced to... 1 "],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
idtitleauthortextlabel
00House Dem Aide: We Didn’t Even See Comey’s Let...Darrell LucusHouse Dem Aide: We Didn’t Even See Comey’s Let...1
11FLYNN: Hillary Clinton, Big Woman on Campus - ...Daniel J. FlynnEver get the feeling your life circles the rou...0
22Why the Truth Might Get You FiredConsortiumnews.comWhy the Truth Might Get You Fired October 29, ...1
3315 Civilians Killed In Single US Airstrike Hav...Jessica PurkissVideos 15 Civilians Killed In Single US Airstr...1
44Iranian woman jailed for fictional unpublished...Howard PortnoyPrint \\nAn Iranian woman has been sentenced to...1
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":3}],"source":["#Define path to data\n","data_path = \"/content/train.csv\"\n","#read data as dataframe\n","df = pd.read_csv(data_path)\n","df.head()"]},{"cell_type":"code","source":["# set the random state\n","random_state = 42\n","np.random.seed(random_state)"],"metadata":{"id":"LUA9nKiE8Kf9","executionInfo":{"status":"ok","timestamp":1683400264229,"user_tz":-60,"elapsed":331,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"LUA9nKiE8Kf9","execution_count":11,"outputs":[]},{"cell_type":"markdown","id":"537579d9","metadata":{"id":"537579d9"},"source":["#### 5. Exploratory Data Analysis"]},{"cell_type":"code","execution_count":null,"id":"840ee31f","metadata":{"id":"840ee31f","outputId":"f3dab38b-415f-48a3-e3d3-b2e53f8b9eef","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683244184790,"user_tz":-60,"elapsed":11,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":["(20800, 5)"]},"metadata":{},"execution_count":6}],"source":["#check data shape\n","df.shape"]},{"cell_type":"code","execution_count":null,"id":"3445582c","metadata":{"id":"3445582c","outputId":"646cefe7-d8ca-40db-fe63-1147c31a3240","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683244184791,"user_tz":-60,"elapsed":10,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["\n","RangeIndex: 20800 entries, 0 to 20799\n","Data columns (total 5 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 id 20800 non-null int64 \n"," 1 title 20242 non-null object\n"," 2 author 18843 non-null object\n"," 3 text 20761 non-null object\n"," 4 label 20800 non-null int64 \n","dtypes: int64(2), object(3)\n","memory usage: 812.6+ KB\n"]}],"source":["#check data types of features\n","df.info()"]},{"cell_type":"code","execution_count":7,"id":"81e47852","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"81e47852","executionInfo":{"status":"ok","timestamp":1683396570480,"user_tz":-60,"elapsed":437,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"593d348b-4433-44f0-d6c9-4a65934e72d3"},"outputs":[{"output_type":"stream","name":"stdout","text":["Columns are Index(['title', 'author', 'text', 'label'], dtype='object')\n"]}],"source":["#drop id column\n","df = df.drop(\"id\",axis= 1)\n","print(f\"Columns are {df.columns}\")"]},{"cell_type":"code","execution_count":null,"id":"61fcfed9","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"61fcfed9","executionInfo":{"status":"ok","timestamp":1683244185271,"user_tz":-60,"elapsed":3,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"e5bfbd3f-4800-429c-ee62-02d8896c3b80"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["title 558\n","author 1957\n","text 39\n","label 0\n","dtype: int64"]},"metadata":{},"execution_count":9}],"source":["#check missing values\n","df.isnull().sum()"]},{"cell_type":"code","execution_count":null,"id":"1613f7d3","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"1613f7d3","executionInfo":{"status":"ok","timestamp":1683241692941,"user_tz":-60,"elapsed":8,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"3b87e7c9-f340-43f8-dd76-0960101ce3ef"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["1 10413\n","0 10387\n","Name: label, dtype: int64"]},"metadata":{},"execution_count":8}],"source":["#Check number of classes in label\n","df[\"label\"].value_counts()"]},{"cell_type":"code","execution_count":null,"id":"8149beb7","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":420},"id":"8149beb7","executionInfo":{"status":"ok","timestamp":1683237748981,"user_tz":-60,"elapsed":11,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"091d7f96-b5fe-4fce-8f76-445b49c56e8c"},"outputs":[{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"\n"},"metadata":{}}],"source":["#Draw distibution of classes in label to check for imbalance\n","sizes = df[\"label\"].value_counts().values\n","labels = [\"Real news\",\"Fake news\"]\n","\n","fig, ax = plt.subplots()\n","ax.pie(sizes, labels=labels, autopct='%1.1f%%',\n"," pctdistance=1.25, labeldistance=.6)\n","plt.show()\n"]},{"cell_type":"markdown","id":"4ceb6c5e","metadata":{"id":"4ceb6c5e"},"source":["Comment : Data is balanced"]},{"cell_type":"code","execution_count":null,"id":"c1b14704","metadata":{"id":"c1b14704","outputId":"9986f91f-4a67-4580-bdc2-843f9b98ab05"},"outputs":[{"data":{"text/plain":["'Jackie Mason: Hollywood Would Love Trump if He Bombed North Korea over Lack of Trans Bathrooms (Exclusive Video) - Breitbart'"]},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":["#Check different titles\n","df[\"title\"][5]"]},{"cell_type":"code","execution_count":null,"id":"60cf9fb5","metadata":{"id":"60cf9fb5","outputId":"baf7b1b1-4837-4491-83a4-7e96a2e12bfe"},"outputs":[{"data":{"text/plain":["'Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress. Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us — the 400 of us,” Miss Rodham told her classmates. After appointing herself Edger Bergen to the Charlie McCarthys and Mortimer Snerds in attendance, the bespectacled in granny glasses (awarding her matronly wisdom — or at least John Lennon wisdom) took issue with the previous speaker. Despite becoming the first to win election to a seat in the U. S. Senate since Reconstruction, Edward Brooke came in for criticism for calling for “empathy” for the goals of protestors as he criticized tactics. Though Clinton in her senior thesis on Saul Alinsky lamented “Black Power demagogues” and “elitist arrogance and repressive intolerance” within the New Left, similar words coming out of a Republican necessitated a brief rebuttal. “Trust,” Rodham ironically observed in 1969, “this is one word that when I asked the class at our rehearsal what it was they wanted me to say for them, everyone came up to me and said ‘Talk about trust, talk about the lack of trust both for us and the way we feel about others. Talk about the trust bust.’ What can you say about it? What can you say about a feeling that permeates a generation and that perhaps is not even understood by those who are distrusted?” The “trust bust” certainly busted Clinton’s 2016 plans. She certainly did not even understand that people distrusted her. After Whitewater, Travelgate, the vast conspiracy, Benghazi, and the missing emails, Clinton found herself the distrusted voice on Friday. There was a load of compromising on the road to the broadening of her political horizons. And distrust from the American people — Trump edged her 48 percent to 38 percent on the question immediately prior to November’s election — stood as a major reason for the closing of those horizons. Clinton described her vanquisher and his supporters as embracing a “lie,” a “con,” “alternative facts,” and “a assault on truth and reason. ” She failed to explain why the American people chose his lies over her truth. “As the history majors among you here today know all too well, when people in power invent their own facts and attack those who question them, it can mark the beginning of the end of a free society,” she offered. “That is not hyperbole. ” Like so many people to emerge from the 1960s, Hillary Clinton embarked upon a long, strange trip. From high school Goldwater Girl and Wellesley College Republican president to Democratic politician, Clinton drank in the times and the place that gave her a degree. More significantly, she went from idealist to cynic, as a comparison of her two Wellesley commencement addresses show. Way back when, she lamented that “for too long our leaders have viewed politics as the art of the possible, and the challenge now is to practice politics as the art of making what appears to be impossible possible. ” Now, as the big woman on campus but the odd woman out of the White House, she wonders how her current station is even possible. “Why aren’t I 50 points ahead?” she asked in September. In May she asks why she isn’t president. The woman famously dubbed a “congenital liar” by Bill Safire concludes that lies did her in — theirs, mind you, not hers. Getting stood up on Election Day, like finding yourself the jilted bride on your wedding day, inspires dangerous delusions.'"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["#chek different text description of news\n","df[\"text\"][1]"]},{"cell_type":"markdown","id":"ab04641f","metadata":{"id":"ab04641f"},"source":["Comment : only the title of the news with be used for training our model to decrease preprocessing time and reduce model complexity since every text description of news seems too large"]},{"cell_type":"markdown","id":"006c2d2d","metadata":{"id":"006c2d2d"},"source":["#### 6. Data Preparation and preprocessing"]},{"cell_type":"code","execution_count":4,"id":"6bc90c8b","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"6bc90c8b","executionInfo":{"status":"ok","timestamp":1683400236563,"user_tz":-60,"elapsed":318,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"582443c8-8d02-4d7d-dfb2-690b4021a92c"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["(18285, 5)"]},"metadata":{},"execution_count":4}],"source":["#Dropping missing values\n","df = df.dropna().reset_index(drop=True)\n","df.shape"]},{"cell_type":"code","source":["df.head()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"gySCz1GWnhAY","executionInfo":{"status":"ok","timestamp":1683244193531,"user_tz":-60,"elapsed":6,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"93efd184-5eb4-4f4a-e59e-8a20b40d5da3"},"id":"gySCz1GWnhAY","execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" title author \\\n","0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus \n","1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn \n","2 Why the Truth Might Get You Fired Consortiumnews.com \n","3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss \n","4 Iranian woman jailed for fictional unpublished... Howard Portnoy \n","\n"," text label \n","0 House Dem Aide: We Didn’t Even See Comey’s Let... 1 \n","1 Ever get the feeling your life circles the rou... 0 \n","2 Why the Truth Might Get You Fired October 29, ... 1 \n","3 Videos 15 Civilians Killed In Single US Airstr... 1 \n","4 Print \\nAn Iranian woman has been sentenced to... 1 "],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
titleauthortextlabel
0House Dem Aide: We Didn’t Even See Comey’s Let...Darrell LucusHouse Dem Aide: We Didn’t Even See Comey’s Let...1
1FLYNN: Hillary Clinton, Big Woman on Campus - ...Daniel J. FlynnEver get the feeling your life circles the rou...0
2Why the Truth Might Get You FiredConsortiumnews.comWhy the Truth Might Get You Fired October 29, ...1
315 Civilians Killed In Single US Airstrike Hav...Jessica PurkissVideos 15 Civilians Killed In Single US Airstr...1
4Iranian woman jailed for fictional unpublished...Howard PortnoyPrint \\nAn Iranian woman has been sentenced to...1
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":11}]},{"cell_type":"code","execution_count":5,"id":"f0b3ac50","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"f0b3ac50","executionInfo":{"status":"ok","timestamp":1683400239526,"user_tz":-60,"elapsed":531,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"81a62e7d-4ff5-4b98-dc29-7c3783409880"},"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package wordnet to /root/nltk_data...\n","[nltk_data] Package wordnet is already up-to-date!\n","[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Package stopwords is already up-to-date!\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":5}],"source":["#Downloading wordnet for nltk to avoid error\n","nltk.download('wordnet')\n","nltk.download('stopwords')"]},{"cell_type":"code","execution_count":6,"id":"72962f94","metadata":{"id":"72962f94","executionInfo":{"status":"ok","timestamp":1683400243237,"user_tz":-60,"elapsed":3253,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[],"source":["#initialise lemmatization object\n","lm = WordNetLemmatizer()\n","stop_words = set(stopwords.words(\"english\"))\n","def preprocess_text(df,feature):\n"," #initialise corpus to store texts p\n"," corpus = []\n"," for i in range(len(df)):\n"," review = re.sub(\"a-zA-Z0-9\",\" \",df[feature][i])\n"," review = review.lower() #convert to lower case\n"," review = review.split() #Tokenize text\n"," review = [lm.lemmatize(x) for x in review if x not in list(stop_words)] #lemmatize and removing stopwords\n"," review = \" \".join(review) #join as text\n"," corpus.append(review)\n"," \n"," return corpus\n","#preprocess text and get desired document \n","corpus = preprocess_text(df = df,feature = \"title\") "]},{"cell_type":"code","source":["#indexing on corpus\n","corpus[0]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":36},"id":"G38ail2Knwn_","executionInfo":{"status":"ok","timestamp":1683394334150,"user_tz":-60,"elapsed":31,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"e6aca2ec-35ae-4752-9924-74278e16f888"},"id":"G38ail2Knwn_","execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'house dem aide: didn’t even see comey’s letter jason chaffetz tweeted'"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","id":"f1bf195d","metadata":{"id":"f1bf195d"},"source":["#### 7. Vectorization with TFIDF"]},{"cell_type":"code","execution_count":49,"id":"6ac22f36","metadata":{"id":"6ac22f36","executionInfo":{"status":"ok","timestamp":1683405979852,"user_tz":-60,"elapsed":2223,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[],"source":["#Convert texts to array using tfidf vectorizer\n","def vectorize(corpus):\n"," tf = TfidfVectorizer(random_state=random_state)\n"," x = tf.fit_transform(corpus).toarray()\n"," return x,tf\n","x,tf = vectorize(corpus)"]},{"cell_type":"code","source":["#save preprocessor\n","# Save the TF-IDF preprocessor using joblib\n","os.chdir(\"/content/drive/MyDrive/Upwork/Fake_news/Artifacts\")\n","joblib.dump(tf, 'tfidf_preprocessor.pkl')\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"v2TM3l2BzpqN","executionInfo":{"status":"ok","timestamp":1683405981843,"user_tz":-60,"elapsed":478,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"c85cca51-4845-421e-d5bd-95c282c1c873"},"id":"v2TM3l2BzpqN","execution_count":50,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['tfidf_preprocessor.pkl']"]},"metadata":{},"execution_count":50}]},{"cell_type":"markdown","id":"0bc0424a","metadata":{"id":"0bc0424a"},"source":["#### 8. Data splitting into train and test"]},{"cell_type":"code","execution_count":12,"id":"5732a186","metadata":{"id":"5732a186","executionInfo":{"status":"ok","timestamp":1683400272809,"user_tz":-60,"elapsed":2671,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"outputs":[],"source":["#split data into training and test using 20 % as test data\n","y = df[\"label\"]\n","x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=random_state,stratify = y)"]},{"cell_type":"code","source":["print(f\"Training data size {len(x_train)}\")\n","print(f\"Test data size {len(x_test)}\")"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"1tm9pAZyvobo","executionInfo":{"status":"ok","timestamp":1683400272811,"user_tz":-60,"elapsed":18,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"943bd264-83a6-4789-f0ae-172789d7911f"},"id":"1tm9pAZyvobo","execution_count":13,"outputs":[{"output_type":"stream","name":"stdout","text":["Training data size 14628\n","Test data size 3657\n"]}]},{"cell_type":"markdown","source":["#### 9. Model building"],"metadata":{"id":"7nDLkBPNv-hZ"},"id":"7nDLkBPNv-hZ"},{"cell_type":"markdown","source":["9.1 Cross validation"],"metadata":{"id":"c34p1KV0wGJt"},"id":"c34p1KV0wGJt"},{"cell_type":"code","source":["#Define a dictionary of models to be used for cross validation\n","models = {\"logistic_regression\":LogisticRegression(),\n"," \"Naive_Bayes\":GaussianNB(),\n"," \"catboost\":catboost.CatBoostClassifier(iterations = 100,random_state=random_state,silent=True)}"],"metadata":{"id":"7HZKADTKzM-5","executionInfo":{"status":"ok","timestamp":1683400460351,"user_tz":-60,"elapsed":501,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"7HZKADTKzM-5","execution_count":18,"outputs":[]},{"cell_type":"code","source":["\n","model_scores = [] # define list to store scores of models\n","model_names = models.keys() #define for names of models\n","\n","def cross_validate(model,x_train:np.ndarray,y_train:np.ndarray,scoring:str):\n"," #Cross validate through every model\n"," cv_results = cross_val_score(model,x_train, y_train, cv=5, scoring=scoring)\n"," mean_accuracy = cv_results.mean()\n"," acc = model_scores.append(mean_accuracy)\n"," return acc\n","#cross validate with logistci regression\n","#cross_validate(model=models[\"logistic_regression\"],x_train=x_train,y_train= y_train,scoring=\"accuracy\")"],"metadata":{"id":"Aiv5b1-Bv9d4","executionInfo":{"status":"ok","timestamp":1683400361423,"user_tz":-60,"elapsed":330,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"Aiv5b1-Bv9d4","execution_count":15,"outputs":[]},{"cell_type":"code","source":["#cross validate with Naive nayes\n","cross_validate(model=models[\"Naive_Bayes\"],x_train=x_train,y_train=y_train,scoring=\"accuracy\")"],"metadata":{"id":"O3Q6QmMK-iAl","executionInfo":{"status":"ok","timestamp":1683396840716,"user_tz":-60,"elapsed":29449,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"O3Q6QmMK-iAl","execution_count":17,"outputs":[]},{"cell_type":"code","source":["#create dataframe for results\n","result_df = pd.DataFrame()\n","result_df[\"model\"] = [\"logistic_regression\",\"Naive Bayes\"]\n","result_df[\"Accuracy\"] = model_scores\n","result_df_sorted = result_df.sort_values(\"Accuracy\",ascending=False)\n","\n","#save dataframe to result folder\n","path_to_result = \"/content/drive/MyDrive/Upwork/Fake_news/Results\"\n","os.chdir(path_to_result)\n","result_df_sorted.to_csv(\"crossval_scores.csv\",index=False)\n","result_df_sorted.head()"],"metadata":{"id":"ctAK2PAcJ-aR","colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"status":"ok","timestamp":1683396840720,"user_tz":-60,"elapsed":29,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"d1e7615c-e4d9-4c5d-e0bd-3856ecf58699"},"id":"ctAK2PAcJ-aR","execution_count":18,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" model Accuracy\n","0 logistic_regression 0.916735\n","1 Naive Bayes 0.646158"],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
modelAccuracy
0logistic_regression0.916735
1Naive Bayes0.646158
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":18}]},{"cell_type":"code","source":["#cross validate with xgboost\n","cb_score = cross_validate(model=models[\"catboost\"],x_train=x_train,y_train=y_train,scoring=\"accuracy\") "],"metadata":{"id":"EME2qywu86x8","executionInfo":{"status":"ok","timestamp":1683401395710,"user_tz":-60,"elapsed":841384,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"EME2qywu86x8","execution_count":19,"outputs":[]},{"cell_type":"code","source":["cb_score = model_scores[0]"],"metadata":{"id":"ag10B6Oci0au","executionInfo":{"status":"ok","timestamp":1683401528475,"user_tz":-60,"elapsed":751,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"ag10B6Oci0au","execution_count":24,"outputs":[]},{"cell_type":"code","source":["cb_score"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"sZc3LCnIjAo1","executionInfo":{"status":"ok","timestamp":1683401539685,"user_tz":-60,"elapsed":288,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"a3dd7658-711f-48a9-bd86-3cb5c560b090"},"id":"sZc3LCnIjAo1","execution_count":25,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0.9191961722488038"]},"metadata":{},"execution_count":25}]},{"cell_type":"code","source":["#Read saved dataframed for other trained model\n","result_df_sorted = pd.read_csv(\"/content/drive/MyDrive/Upwork/Fake_news/Results/crossval_scores.csv\")\n","# create a new row\n","new_row = {'model': 'catboost', 'Accuracy':cb_score}\n","# append the new row to the DataFrame\n","result_df_sorted = result_df_sorted.append(new_row, ignore_index=True)"],"metadata":{"id":"-HOsecFRKE-g","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683401544878,"user_tz":-60,"elapsed":316,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"8dcec9f3-008b-416d-d8f0-b1a2e50e0dde"},"id":"-HOsecFRKE-g","execution_count":26,"outputs":[{"output_type":"stream","name":"stderr","text":[":6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n"," result_df_sorted = result_df_sorted.append(new_row, ignore_index=True)\n"]}]},{"cell_type":"code","source":["#create dataframe for results\n","result_df_sorted.to_csv(\"cross_val.csv\",index=False)\n","result_df_sorted.head()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":144},"id":"KPA6TUf6-rCu","executionInfo":{"status":"ok","timestamp":1683401566942,"user_tz":-60,"elapsed":333,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"62459945-ee9a-410b-fe21-69f934ed49fa"},"id":"KPA6TUf6-rCu","execution_count":27,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" model Accuracy\n","0 logistic_regression 0.916735\n","1 Naive Bayes 0.646158\n","2 catboost 0.919196"],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
modelAccuracy
0logistic_regression0.916735
1Naive Bayes0.646158
2catboost0.919196
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":27}]},{"cell_type":"markdown","source":["Comment : Catboost happens to be the best model so far,we will train our data with this model"],"metadata":{"id":"SjKA6hTPjZst"},"id":"SjKA6hTPjZst"},{"cell_type":"markdown","source":["9.2 Training data with best model from cross validation"],"metadata":{"id":"8ddGBs2H3YGC"},"id":"8ddGBs2H3YGC"},{"cell_type":"code","source":["model = catboost.CatBoostClassifier(iterations = 100,random_state=random_state,silent=True)\n","#train model\n","model.fit(x_train,y_train)"],"metadata":{"id":"W_MHsQb03LFK","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683404450426,"user_tz":-60,"elapsed":277204,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"045c35f0-2f43-4c8e-c926-d367df9773f0"},"id":"W_MHsQb03LFK","execution_count":28,"outputs":[{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":28}]},{"cell_type":"markdown","source":["9.3 Evaluating the performance of trained model on test data"],"metadata":{"id":"Vzd6WBKq3tW6"},"id":"Vzd6WBKq3tW6"},{"cell_type":"code","source":["#predict on test data\n","y_pred= model.predict(x_test)"],"metadata":{"id":"4ZTV-5kC4Xkp","executionInfo":{"status":"ok","timestamp":1683404523302,"user_tz":-60,"elapsed":447,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"4ZTV-5kC4Xkp","execution_count":29,"outputs":[]},{"cell_type":"code","source":["#get accuracy,f1_score and auc_roc_score\n","y_true = y_test\n","print(f\"Accuracy:{accuracy_score(y_true, y_pred)}\")\n","print(f\"f1_score:{f1_score(y_true, y_pred)}\")\n","print(f\"roc_auc_score:{roc_auc_score(y_true, y_pred)}\")"],"metadata":{"id":"RyN2JYod58pj","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683404593800,"user_tz":-60,"elapsed":308,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"0ec6d6c3-2ff5-4a05-f879-587d41c008a1"},"id":"RyN2JYod58pj","execution_count":31,"outputs":[{"output_type":"stream","name":"stdout","text":["Accuracy:0.9278096800656276\n","f1_score:0.9221238938053098\n","roc_auc_score:0.9346622535108339\n"]}]},{"cell_type":"code","source":["#print classification report\n","print(classification_report(y_test, y_pred))"],"metadata":{"id":"hDDvNUvg4vzl","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1683404620466,"user_tz":-60,"elapsed":5,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"64edcd36-dd2b-430b-cde4-9fac50ea3fd2"},"id":"hDDvNUvg4vzl","execution_count":33,"outputs":[{"output_type":"stream","name":"stdout","text":[" precision recall f1-score support\n","\n"," 0 0.99 0.88 0.93 2072\n"," 1 0.87 0.99 0.92 1585\n","\n"," accuracy 0.93 3657\n"," macro avg 0.93 0.93 0.93 3657\n","weighted avg 0.94 0.93 0.93 3657\n","\n"]}]},{"cell_type":"code","source":["#plot confusion matrix\n","cm = confusion_matrix(y_true, y_pred)\n","# Save the classification report to a file\n","report = classification_report(y_test, y_pred)\n","with open(\"classification_report.txt\", \"w\") as f:\n"," f.write(report)\n","os.chdir(\"/content/drive/MyDrive/Upwork/Fake_news/Results\")\n","# Plot the confusion matrix\n","plt.figure(figsize=(8, 6))\n","sns.heatmap(cm, annot=True, cmap=\"Blues\", fmt=\"d\",)\n","plt.title(\"Confusion Matrix\")\n","plt.xlabel(\"Predicted Class\")\n","plt.ylabel(\"True Class\")\n","plt.savefig(\"confusion_matrix.png\")"],"metadata":{"id":"-yjPvPhV5rSE","colab":{"base_uri":"https://localhost:8080/","height":564},"executionInfo":{"status":"ok","timestamp":1683404897346,"user_tz":-60,"elapsed":1036,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"ae90d320-cee6-459d-debe-746c1c1ac39f"},"id":"-yjPvPhV5rSE","execution_count":36,"outputs":[{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"\n"},"metadata":{}}]},{"cell_type":"markdown","source":["#### 9.4 Saving and loading model\n"],"metadata":{"id":"SDS1enG7tXwe"},"id":"SDS1enG7tXwe"},{"cell_type":"code","source":["#save model\n","path_to_save = \"/content/drive/MyDrive/Upwork/Fake_news/Models\"\n","os.chdir(path_to_save)\n","model.save_model(\"cb_fakes_news_model.cbm\")\n","#load saved model\n","loaded_model = catboost.CatBoostClassifier()\n","loaded_model.load_model('cb_fakes_news_model.cbm')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"C7DRj8t7tiuw","executionInfo":{"status":"ok","timestamp":1683405004496,"user_tz":-60,"elapsed":314,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"216de1c0-90f8-48e5-ac81-7b7747708764"},"id":"C7DRj8t7tiuw","execution_count":38,"outputs":[{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":38}]},{"cell_type":"markdown","source":["#### 10. Prediction pipeline"],"metadata":{"id":"PBlzl7olYzAb"},"id":"PBlzl7olYzAb"},{"cell_type":"code","source":["class Preprocessing:\n"," def __init__(self,data):\n"," self.data = data\n","\n"," def preprocess_text(self):\n"," lm = WordNetLemmatizer()\n"," #initialise corpus to store texts p\n"," pred_data = [self.data]\n"," preprocessed_data = []\n","\n"," for data in pred_data:\n"," review = re.sub(\"a-zA-Z0-9\",\" \",data)\n"," review = review.lower() #convert to lower case\n"," review = review.split() #Tokenize text\n"," review = [lm.lemmatize(x) for x in review if x not in list(stop_words)] #lemmatize and removing stopwords\n"," review = \" \".join(review) #join as text\n"," preprocessed_data.append(review)\n"," \n"," return preprocessed_data\n","\n","class Prediction:\n"," def __init__(self,pred_data,model):\n"," self.pred_data = pred_data\n"," self.model = model\n","\n","\n","\n"," def predict(self):\n"," preprocess_data = Preprocessing(self.pred_data).preprocess_text()\n","\n"," # Load the saved TF-IDF preprocessor using joblib\n"," path = \"/content/drive/MyDrive/Upwork/Fake_news/Artifacts/tfidf_preprocessor.pkl\"\n"," loaded_tfidf = joblib.load(path)\n"," data = loaded_tfidf.transform(preprocess_data)\n"," predicted = self.model.predict(data)\n","\n"," if predicted[0] == 0:\n"," return \"The news is fake\"\n"," else:\n"," return \"The news is real\"\n","\n","\n"," "],"metadata":{"id":"fbk591IfY4ba","executionInfo":{"status":"ok","timestamp":1683407566665,"user_tz":-60,"elapsed":306,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}}},"id":"fbk591IfY4ba","execution_count":80,"outputs":[]},{"cell_type":"code","source":["df[\"title\"][8]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":36},"id":"00t5rh503w3c","executionInfo":{"status":"ok","timestamp":1683406927006,"user_tz":-60,"elapsed":354,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"50f8445a-c79c-4c7c-a513-f357050b2434"},"id":"00t5rh503w3c","execution_count":69,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'Obama’s Organizing for Action Partners with Soros-Linked ‘Indivisible’ to Disrupt Trump’s Agenda'"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":69}]},{"cell_type":"code","source":["#make prediction on user data\n","single_data = df[\"title\"][10]\n","pred_ = Prediction(single_data,loaded_model).predict()\n","pred_"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"7NKPY7VkyaV6","executionInfo":{"status":"ok","timestamp":1683407698008,"user_tz":-60,"elapsed":332,"user":{"displayName":"Owusu Samuel","userId":"05986356755319312051"}},"outputId":"bf2583c9-0134-4d08-f413-b8d4986def04"},"id":"7NKPY7VkyaV6","execution_count":99,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['russian researcher discover secret nazi military base ‘treasure hunter’ arctic [photos]']"]},"metadata":{},"execution_count":99}]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.16"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":5}