# Youtube Hate Speech ML Project



In [None]:
# First thing is to import libraries.
# I am familiar with pandas and numpy, but lets research the other libraries.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

### Feature Extraction
- `from sklearn.feature_extraction.text import CountVectorizer` coverts text documents into a matrix
  of token counts
- count vectorizers assign numbers to all instances of "features", or the words in a document.
- `model_selection` provides tools for model selection and evaluation.
- `train_test_split` will split our data into training and testing sets.
- we use the DecisionTreeClassifier to train our model. We will use a decision tree to create labels and classify thise labels.

- stopwords are words frequently occuring in a language and removed during text preprocessing.
- `import pr` prints objects in a human-readable format and is used for working with nltk.
- `from nltk.stem.snowball import SnowballStemmer` imports the stemming algorithm, which reduces words to their stem.
- An example of the Stemming aglorithm is reducing the word 'running' to 'run'


In [None]:
import re
import string
import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Imports

- `import re` imports the "regular expressions" module, which allows us to work with strings easier.
- `import string` imports the `string` modile, which allows for easier string formatting for text processing tasks.

### Import `nltk` is SUPER important. it does the following: Tokenization: Breaking text into words or sentences.

1. Part-of-Speech Tagging: Assigning grammatical categories (e.g., noun, verb, adjective) to words in a sentence.
2. Named Entity Recognition (NER): Identifying and classifying named entities such as persons, organizations, and locations in text.
3. Parsing: Analyzing the syntactic structure of sentences.
4. WordNet Integration: Accessing lexical database for English, WordNet, for synonyms, antonyms, and other lexical relationships.
5. Text Corpora: Access to various text corpora for training and testing NLP models.
5. Text Classification And More

`nltk.corpus` is a module by Natural language toolkit. Contains large sets of text for linguistic analysis and development.

`.words` method access words from our corpus and in this instance, we are calling the stopwords in the english set.

In [None]:
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword = set(stopwords.words("english"))

### The dataset we will use can be found here:
https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbmp2NDVmdmoyNGpSS2hrdGNwdndvRl9WZnpqQXxBQ3Jtc0traEYtV0J4Ym5iYlJJa05tOXJyc1RSOEVzcnhOTUVxbU9YQnV2TWNZWVZ4WWZwRThCOTRUR2hFam9mbDZ5cW1Pa0VfRXhTcmhVaTBvX3pWeUN0THhwYVQycWZVbE1vcmxnakphdFF3SGxCMXhDNF9FUQ&q=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1uQiyJ_mDlOCcecMw7C-JYUs9bGnVJ_j8%3Fusp%3Dsharing&v=jbexvUovHxw

- we import our dataset
- we dropped any `NaN` values
- we summoned the info of our dataset.

In [None]:
df1 = pd.read_csv("twitter_data.csv")
df1 = df1.dropna()
df1.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


#### `.tolist()` converts NumPy arrays into Python lists.

In [None]:
print(df1.columns.tolist())


['Unnamed: 0', 'count', 'hate_speech', 'offensive_language', 'neither', 'class', 'tweet']


- The `.map()` function applies a specified function to an iterable and returns the result.
- We used the `.map` function to assign 0, 1, and 2 to "Hate Speech Detected", "Offensive language detected", and "No hate and -  - offensive speech"

### Preprocess the Labels

In [None]:
df1['labels'] = df1['class'].map({0:"Hate Speech Detected", 1:"Offensive language detected", 2:"No hate and offensive speech"})

# Merging the labels
def unify_labels(row):
    if row['labels'] in ['Hate Speech Detected', 'Offensive language detected']:
        return 'Offensive or Hate Speech'
    else:
        return 'Not Hate'

# Apply this function to the dataset with three labels
df1['labels'] = df1.apply(unify_labels, axis=1)

In [None]:
df1['labels'].info
df1.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,labels
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,Not Hate
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,Offensive or Hate Speech
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,Offensive or Hate Speech
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,Offensive or Hate Speech
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,Offensive or Hate Speech


### Import the second dataset

In [None]:
!pip install datasets

from datasets import load_dataset

df2 = load_dataset("LennardZuendorf/Dynamically-Generated-Hate-Speech-Dataset")



In [None]:
# Convert it to a DataFrame
df2 = df2['train'].to_pandas() # or dataset['split'].to_pandas()

# Rename columns
df2.rename(columns={'text': 'tweet', 'label': 'labels'}, inplace=True)

# Replace label values
label_mapping = {'hate': 'Offensive or Hate Speech', 'nothate': 'Not Hate'}
df2['labels'] = df2['labels'].replace(label_mapping)

In [None]:
df2['labels'].info
df2.head()

Unnamed: 0.1,Unnamed: 0,acl.id,X1,tweet,labels,type,target,level,split,round.base,annotator,round,acl.id.matched
0,1,acl1,1,dalits are lowlives,Offensive or Hate Speech,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
1,2,acl10,2,gay people are a burden to our society,Offensive or Hate Speech,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
2,3,acl100,3,Arabs are not welcome,Offensive or Hate Speech,notgiven,notgiven,original,train,1,vDe7GN0NrL,1,
3,4,acl1000,4,I'm not saying we should actually eliminate he...,Offensive or Hate Speech,notgiven,notgiven,original,train,1,oemYWm1Tjg,1,
4,5,acl10000,5,bananas are for black people,Offensive or Hate Speech,notgiven,notgiven,original,test,1,QiOKkCi7F8,1,


### Import the third dataset

In [None]:
# https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech

df3 = load_dataset('ucberkeley-dlab/measuring-hate-speech', 'default')

# Convert it to a DataFrame
df3 = df3['train'].to_pandas()

def classify_or_mark_for_test(row):
    if row['hate_speech_score'] >= -0.25:
        return 'Offensive or Hate Speech'
    elif row['hate_speech_score'] < -0.25:
        return 'Not Hate'
    else:
        return None

# Apply the modified function
df3['labels'] = df3.apply(classify_or_mark_for_test, axis=1)

# Rename columns
df3.rename(columns={'text': 'tweet'}, inplace=True)

In [None]:
df3['labels'].info
df3.head()

Unnamed: 0,comment_id,annotator_id,platform,sentiment,respect,insult,humiliate,status,dehumanize,violence,...,annotator_religion_jewish,annotator_religion_mormon,annotator_religion_muslim,annotator_religion_nothing,annotator_religion_other,annotator_sexuality_bisexual,annotator_sexuality_gay,annotator_sexuality_straight,annotator_sexuality_other,labels
0,47777,10873,3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,True,False,Not Hate
1,39773,2790,2,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,True,False,Not Hate
2,47101,3379,3,4.0,4.0,4.0,4.0,4.0,4.0,0.0,...,False,False,False,True,False,False,False,True,False,Offensive or Hate Speech
3,43625,7365,3,2.0,3.0,2.0,1.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,True,False,Offensive or Hate Speech
4,12538,488,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,False,False,False,False,False,False,False,True,False,Offensive or Hate Speech


### Formated to two tables of tweets and labels

In [None]:
df1 = df1[['tweet', 'labels']]
df1 = df1[['tweet', 'labels']].fillna(0)
df1.head()

Unnamed: 0,tweet,labels
0,!!! RT @mayasolovely: As a woman you shouldn't...,Not Hate
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,Offensive or Hate Speech
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,Offensive or Hate Speech
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,Offensive or Hate Speech
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,Offensive or Hate Speech


In [None]:
df2 = df2[['tweet', 'labels']]

# Manually set the label for rows with indexes 5 and 17 to "Offensive or Hate Speech"
df2.loc[[5, 17], 'labels'] = 'Offensive or Hate Speech'
df2.head()

Unnamed: 0,tweet,labels
0,dalits are lowlives,Offensive or Hate Speech
1,gay people are a burden to our society,Offensive or Hate Speech
2,Arabs are not welcome,Offensive or Hate Speech
3,I'm not saying we should actually eliminate he...,Offensive or Hate Speech
4,bananas are for black people,Offensive or Hate Speech


In [None]:
df3 = df3[['tweet', 'labels']]
df3.head()

Unnamed: 0,tweet,labels
0,Yes indeed. She sort of reminds me of the elde...,Not Hate
1,The trans women reading this tweet right now i...,Not Hate
2,Question: These 4 broads who criticize America...,Offensive or Hate Speech
3,It is about time for all illegals to go back t...,Offensive or Hate Speech
4,For starters bend over the one in pink and kic...,Offensive or Hate Speech


### Now begins the process of cleaning the text

- `clean(text)` invokes a function to perform text cleaning
- `text = re.sub('\[.*?\]', '', text)` uses the `re` module to replace all characters `('\[.*?\]') with whitespace.

#### `re.sub` works like this: import `re`

text = "Hello, world! This is a test string."

#### Replace all occurrences of 'world' with 'planet'

new_text = re.sub(r'world', 'planet', text)
print(new_text)`

- we use the same `re.sub` method to replace `'https?://\S+|www\.\S+'` and `'<.*?>+'` and fill with `''`, or whitespace.
- breaking down the `('[%s]' % re.escape(string.punctuation)` code, `string.punctuation` contains all of the punctuation characters. `re.escape` will escape any punctuation string and treat them as literals.

- we put the `%s` in brackets `[]` because it is a charachter class, so the function will match any charachter contained in the  given charachter class.
- `%s` concattenates strings together.
- `text = re.sub('\n', '', text)` removes all of the newline characters from the text and replaces it with whitespace.

### we will break down `text = re.sub('\w*\d\w*', "", text)`
- ` \w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
- ` \d` matches decimal digits 0-9
- ` *` matches zero or more occurances of the preciding word character.

### `text = [word for word in text.split() if word not in stopword]` operates as such:

1. `text.split()` splits input text into a list of words
2. the function `[word for word in text.split() if word not in stopword]` creates a new list containing our words that are not stopwords.
3. we then return the cleaned text with the  `return text` call,
4. We then apply our function to the "tweet" variable.

This was not a part of the tutorial, but I applied `df["tweet"] = df["tweet"].dropna()` to eliminate NaN values from the dataset.


In [None]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', "", text)
    text = [word for word in text.split() if word not in stopword]
    text = " ".join(text)
    return text
# Apply cleaning function to the 'tweet' column of both dataframes
df1['tweet'] = df1['tweet'].apply(clean)

In [None]:
# Print the head of df1
print("df1 head:")
print(df1.head())

# Add a separator for better readability
print("\n" + "-"*78 + "\n")

# Print the head of df2
print("df2 head:")
print(df2.head())

# Add a separator for better readability
print("\n" + "-"*78 + "\n")

# Print the head of df3
print("df3 head:")
print(df3.head())

df1 head:
                                               tweet                    labels
0  rt mayasolovely woman shouldnt complain cleani...                  Not Hate
1  rt boy dats coldtyga dwn bad cuffin dat hoe place  Offensive or Hate Speech
2  rt urkindofbrand dawg rt ever fuck bitch start...  Offensive or Hate Speech
3           rt cganderson vivabased look like tranny  Offensive or Hate Speech
4  rt shenikaroberts shit hear might true might f...  Offensive or Hate Speech

------------------------------------------------------------------------------

df2 head:
                                               tweet                    labels
0                                dalits are lowlives  Offensive or Hate Speech
1             gay people are a burden to our society  Offensive or Hate Speech
2                              Arabs are not welcome  Offensive or Hate Speech
3  I'm not saying we should actually eliminate he...  Offensive or Hate Speech
4                       banana

### Our data is ready. Now to build the classification model.

- we start by creating NumPy arrays of our dataset and labels. with `np.array(df1[])`

- `CountVectorizer()` comverts our text data into a matrix of token accounts.

- `cv.fit_transform` applies the `fit_transform` method of the CountVectorizeer with x.

- fit_transform analyzes our data to find characteristcs and transforms the data based on the learned parameters.

- Next, we create the variables "X_train, X_test, y_train, y_test" and assign them to `train_test_split`
- We input our x and y values into `train_test_split` with a `test_size` of .33 and a `random_state` of 42
- the `test_size` value .33 means that 33 % of the data will be used for testing, while 67 % will be used for training.
- the `random_state` function randomly shuffles our train and test values to mitigate bias.
- assigning `random_state` to 42 will produce the same results after multiple executions.

### Decision Tree Classifier

- Decision tree chooses best feature (the root)
- Next, it creates a split based on a feature
- Then, the decision tree repeats the above steps until the leaf node is pure, meaning there are no more decisions to make.
- We chose a Decision Tree because our data and what we want it to do (predict hate speech).

![image.png](attachment:image.png)

#### Finally, we fit our model.

- The `.fit` method takes the input features (X) and corresponding target labels (y) as arguments and uses them to train the model. In other words, fitting the model == training the model.

- the line of code `clf.fit(X_train,y_train)` is equivalent to `DecisionTreeClassifier().fit(X_train,y_train)`.

- We now have a trained classification model!


In [None]:
combined_df = pd.concat([df1, df2, df3], ignore_index=True)

x = np.array(combined_df["tweet"])
y = np.array(combined_df["labels"])

cv = CountVectorizer()
x = cv.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.25, random_state = 42)
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)



In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Accuracy: 0.9212820301557222
                          precision    recall  f1-score   support

                Not Hate       0.91      0.92      0.91     27996
Offensive or Hate Speech       0.93      0.92      0.93     32689

                accuracy                           0.92     60685
               macro avg       0.92      0.92      0.92     60685
            weighted avg       0.92      0.92      0.92     60685



### Testing the model
- test_data will be the words that are input to test if the words are offensive or not.

#### Let's break down the code `df = cv.transform([test_data]).toarray()` :


- `cv.transform`  converts text data into a numerical representation suitable for machine learning algorithms to process.
- We are transforming the `test_data` into a numerical vector representation using the `transform()` method.
- Next, we use `.toarray()`  to convert the previously transformed test data into a Numpy Array. This is due to sklearn requires dense arrays as input.

#### Lastly, we will make predictions using the trained model.

- The  `predict` method uses the data from `df` to return predicted values in an array containing the different values or labels for each data point.
- In the case of our Hate speech ML Model, we use the `predict()` method to predict weather the text entered falls within our labels "Hate Speech Detected", "Offensive language detected", or "No hate and offensive speech".
- `clf.predict(df)` means that the `decsion_tree_classifier` uses the `predict` method to take the features from our `df` array and make predictions based on learned patterns and relationships captured by the model during training.

### Below is the code that the video wanted me to use to test this model. I will write a function later  so we do not have to re-type the test data  string every time.
- If you would like to test out the model, edit the words in the "I will kill you" string.
- This model is not perfect; test it out and document what works and what does not.

In [None]:
test_data = "Arabs welcome"
combined_df = cv.transform([test_data]).toarray()
print(clf.predict(combined_df))

['Offensive or Hate Speech']


# Conclusion of  Hate Speech Detection project

- I learned what many modules in sklearn do, such as `train_test_split()`,`fit()` and  `CountVectorizer`.
- I learned about the nltk library, which is quite useful for NLP and text processing tasks.
- I learned about decision trees and stopwords.
- I learned `.tolist()` converts NumPy arrays into Python lists.
- I learned How to use the `map()` function
- I learned much about the `re.sub` module for text cleaning
- I learned why we use the values .33 and 42 as values in our decision tree model.
- I learned how to train a test set. For this lab, it was done by:
`X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.33, random_state = 42)`
- I learned that the data needs to be transformed into a NumPy array with `.toarray()` before we can make predictions.
- I learned  that the `.predict` function predicts values based on the trained classifier. In this case, it was `clf`.



