A newer version of the Gradio SDK is available:
5.9.1
title: Famous Landmarks Classifier Cnn
emoji: 🌉
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 3.45.2
app_file: app.py
pinned: false
license: mit
Landmarks Classification and Tagging using CNN
In this project we solve a multi-label-classification
problem by classifying/tagging a given image of a famous landmark using CNN (Convolutional Neural Network).
Features
⚡Multi Label Image Classification
⚡Custom CNN
⚡Transfer Learning CNN
⚡PyTorch
Table of Contents
- Introduction
- Objective
- Dataset
- Evaluation Criteria
- Solution Approach
- How To Use
- License
- Get in touch
- Credits
Introduction
Photo sharing and photo storage services like to have location data for each uploaded photo. In addition, these services can build advanced features with the location data, such as the automatic suggestion of relevant tags or automatic photo organization, which help provide a compelling user experience. However, although a photo's location can often be obtained by looking at the photo's metadata, many images uploaded to these services will not have location metadata available. This can happen when, for example, the camera capturing the picture does not have GPS or if a photo's metadata is scrubbed due to privacy concerns.
If no location metadata for an image is available, one way to infer the location is to detect and classify a discernible landmark in the picture. However, given the large number of landmarks worldwide and the immense volume of images uploaded to photo-sharing services, using human judgment to classify these landmarks would not be feasible. In this project, we'll try to address this problem by building Neural Network
(NN) based models to automatically predict the location of the image based on any landmarks depicted in the picture.
Objective
To build NN based model that'd accept any user-supplied image as input and suggest the top k
most relevant landmarks from '50 possible` landmarks from across the world.
- Download the dataset
- Build a CNN based neural network from scratch to classify the landmark image
- Here, we aim to attain a test accuracy of at least 30%. At first glance, an accuracy of 30% may appear to be very low, but it's way better than random guessing, which would provide an accuracy of just 2% since we have 50 different landmarks classes in the dataset.
- Build a CNN based neural network, using transfer-learning, to classify the landmark image
- Here, we aim to attain a test accuracy of at least 60%, which is pretty good given the complex nature of this task.
- Implement an inference function that will accept a file path to an image and an integer k and then predict the top k most likely landmarks this image belongs to. The print below displays the expected sample output from the predict function, indicating the top 3 (k = 3) possibilities for the image in question.
Dataset
- Dataset to be downloaded from here. Note that this is a mini dataset containing around 6,000 images); this dataset is a small subset of the Original Landmark Dataset that has over 700,000 images.
- Unzipped dataset would have the parent folder
landmark_images
containing training data in thetrain
sub-folder and testing data in thetest
sub-folder - There are 1250 images in the
test
sub-folder to be kept hidden and only used for model evaluation - There are 4996 images in the
train
sub-folder to be used for training and validation - Images in
test
andtrain
sets are further categorized and kept in one of the 50 sub-folders representing 50 different landmarks classes (from 0 to 49) - Images in the dataset are of different sizes and resolution
- Here are a few samples from the training dataset with their respective labels descriptions...
Evaluation Criteria
Loss Function
We will use LogSoftmax
in the output layer of the network...
We need a suitable loss function that consumes these log-probabilities
outputs and produces a total loss. The function that we are looking for is NLLLoss
(Negative Log-Likelihood Loss). In practice, NLLLoss
is nothing but a generalization of BCELoss
(Binary Cross EntropyLoss or Log Loss) extended from binary-class to multi-class problem.
Note the negative
sign in front NLLLoss
formula hence negative in the name. The negative sign is put in front to make the average loss positive. Suppose we don't do this then since the log
of a number less than 1 is negative. In that case, we will have a negative overall average loss. To reduce the loss, we need to maximize
the loss function instead of minimizing,
which is a much easier task mathematically than maximizing.
Performance Metric
accuracy
is used as the model's performance metric on the test-set
Solution Approach
- Once the dataset is downloaded and unzipped, we split the training set into training and validation sets in 80%:20% (3996:1000) ratio and keep images in respective
train
andval
sub-folders. train
data is then used to build PytorchDataset
object; after applying data augmentations, images are resized to 128x128.mean
andstandard deviation
is computed for the train dataset, and then the dataset isnormalized
using the calculated statistics.- The RGB channel histogram of the train set is shown below...
- The RGB channel histogram of the train set after normalization is shown below...
- Now,
test
andval
Dataset objects are prepared in the same fashion where images are resized to 128x128 and then normalized. - The training, validation, and testing datasets are then wrapped in Pytorch
DataLoader
object so that we can iterate through them with ease. A typicalbatch_size
32 is used.
CNN from scratch
- The neural network is implemented as a subclass of the
nn.Module
PyTorch class. The final network presented here is built incrementally with many experiments...- Started with a very small CNN of just two convolutions and a linear layer with LogSoftmax output.
- Tried to overfit the network on a single batch of 32 training images, but the network found it hard to overfit, which means it's not powerful enough.
- Gradually increased the Conv and Linear layers to overfit the batch easily.
- Then trained on complete training data, adjusted layers, and output sizes to ensure that training loss goes down.
- Then, trained again with validation data to select the best network with the lowest validation loss.
ReLU
is used as an activation function, andBatchNorm
is used after every layer except the last.- Final model architecture (from scratch) is shown below...
- Network initial weights are initialized by numbers drawn from a `normal-distribution in the range...
- Network is then trained and validated for 15 epochs using the
NLLLoss
function andAdam
optimizer with a learning rate of 0.001. We save the trained model here asignore.pt
(ignore because we are not using it for evaluation) - We keep track of training and validation losses. When plotted, we observe that the model starts to
overfit
very quickly.
- Now, we reset the Network initial weights to Pytorch default weight to check if there are any improvements
- Network is then again trained and validated for 15 epochs using the
NLLLoss
function andAdam
optimizer with a learning rate of 0.001. We save the trained model here asmodel_scratch.pt
(we will use this saved model for evaluation) - We keep track of training and validation losses. When plotted, we observe that result is almost the same as that of custom weight initialization
- The trained network (
model_scratch.pt
) is then loaded and evaluated on unseen 1,250 testing images. The network can achieve around38%
accuracy, which is more than we aimed for (i.e., 30%). Furthermore, the network can classify475
images out of the total1250
test images.
CNN using transfer-learning
- Here, we use transfer-learning to implement the CNN network to classify images of landmarks.
- We have selected the
VGG19
pre-trained model onImageNet
as our base model. Models pre-trained and tested on ImgaeNet can extract general features from even the datasets that may not be very similar to ImageNet. This is due to the sheer size of the ImageNet dataset (1.2 million images) and the number of classes (1000). Instead ofVGG19
, we could have chosenResNet
DenseNet
as our base network; they would have worked just fine.VGG19
was selected here because of its simplicity of the architecture and still producing an impressive result. - VGG19 models weights are frozen so that they do not change during the training.
- A
custom-classifier
withReLU
activation,Dropouts
in hidden layers andLogSoftmax
in last layer is created. The original classifier layer in VGG19 is replaced by acustom-classifier
with learnable weights. - Final model architecture (transfer learning) is shown below...
- We have selected the
- Network is then trained and validated for ten epochs using the
NLLLoss
function andAdam
optimizer with a learning rate of 0.001. Note that the optimizer has been supplied with the learnable parameters ofcustom-classifier
only and not the whole model. This is because we want to optimize our custom-classifier weights only and use ImageNet learned weights for the rest of the layers. - We keep track of training and validation losses and plot them.
The trained network is saved as
model_transfer.pt
The trained network
model_transfer.pt
is then loaded and evaluated on unseen 1,250 testing images.This time network can achieve around
63%
accuracy, which is more than what we aimed for (i.e., 60%). In addition, the network can classify788
images out of the total1250
test images. As we can see, the model built using transfer learning has outperformed the model built from scratch; hence, the second model will be used to predict unseen images.
Interface for inference
For our model to be used easily, we'll implement a function
predict_landmarks
which will...- Accepts a
file-path
to an image and an integerk
The function expects the trained modelmodel_transfer.pt
to be present in the same folder/directory from where the function is invoked. The trained model can be downloaded from here - It predicts and returns the top k most likely landmarks.
predict_landmarks
function can be invoked from thepython
script or shell; an example is shown below...
>>> predicted_landmarks = predict_landmarks('images/test/09.Golden_Gate_Bridge/190f3bae17c32c37.jpg', 5) >>> print(predicted_landmarks) ['Golden Gate Bridge', 'Forth Bridge', 'Sydney Harbour Bridge', 'Brooklyn Bridge', 'Niagara Falls']
- Accepts a
We create another higher-level function,
suggest_locations
that accepts the same parameters as that ofpredict_landmarks
and internally uses thepredict_landmarks
functionA sample of function usage and its output is shown below
>>> suggest_locations('assets/Eiffel-tower_night.jpg')
How To Use
Open the LIVE app
App has been deployed on Hugging Face Spaces
.
Training and Testing using jupyter notebook
- Ensure the below-listed packages are installed
NumPy
matplotlib
torch
torchvision
cv2
PIL
- Download
landmark-classification-cnn-pytorch.ipynb
jupyter notebook from this repo - To train the models, it's recommended to execute the notebook one cell at a time. If a GPU is available (recommended), it'll use it automatically; otherwise, it'll fall back to the CPU.
- On a machine with
NVIDIA Quadro P5000
GPU with 16GB memory, it approximately takes 15-18 minutes to train and validate thefrom scratch
model for 15 epochs - On a machine with
NVIDIA Quadro P5000
GPU with 16GB memory, it approximately takes 15-18 minutes to train and validate thetransfer-learning
model for ten epochs - A fully trained model
model_transfer.pt
can be downloaded from here. This model then can be used directly for tagging new landmark images as described in Interface for inference section.
License
Get in touch
Credits
- Dataset used in this project is provided by Udacity
- Above dataset is a subset taken from the original landmarks dataset by Google Original Landmark Dataset
- Title photo by Patty Jansen On Pixabay