Image Classification and Segmentation of Pokemon images using Deep Neural Networks

May 8, 2022

Index

1. Objective2. Implementation:2.1 Multiclass Classification2.2 Multilabel Classification 2.3 Semantic Segmentation3. Results

1. Objective

The goal of this assignment was to create, evaluate and select deep neural networks for image classification and segmentation. The dataset consists of 4000 photos of Pokemon found near the FCT campus. Here is a composite of the first 100 images in the dataset:
Each image is an array of 64x64x3, for 64x64 pixels with three color channels (RGB). In addition, we were also given segmentation masks indicating the region corresponding to each Pokemon, as illustrated here:
Finally, we also have a CSV file indicating, for each image, the name, primary and secondary type of each pokemon, in the same order as the images, and a txt file listing the 10 different types. The CSV table also includes the index of the types for each pokemon for convenience.
.csv
Charizard,Fire,Flying,2,3
Vileplume,Grass,Poison,4,7
Venonat,Bug,Poison,0,7
...
.txt
Bug
Fighting
Fire
...
In this assignment we completed these tasks:
Multiclass classification:
Design and train a neural network that can predict the main type of a Pokemon from an image.
Multilabel classification:
Design and train a neural network that can predict the types (primary and secondary type) associated with a Pokemon from an image.
Semantic segmentation:
Design and train a neural network that can create a mask showing the pixels that correspond to the Pokemon in an image.
Index

2. Implementation

The best way to solve these kinds of image classification and segmentation problems is to use Convolutional Neural Networks. For each of the tasks I created a network of this type and customized it so that it could solve the problem efficiently and correctly. Here is a small guide I found that explains in a simple way this types of networks and its different components.
2.1 Multiclass classification:
For this problem I used the following network architecture:
""" Model Architecture """
Input(shape=(64,64, 3), name='inputs')

Conv2D(32, ( 3, 3), padding='same')
Activation( 'relu')
BatchNormalization(axis=-1)

Conv2D(32, (3, 3), padding='same')
Activation( 'relu')
BatchNormalization(axis=-1)

MaxPooling2D(pool_size=(2, 2))

Conv2D( 32, (3, 3), padding='same')
Activation('relu')
BatchNormalization(axis= -1)

MaxPooling2D(pool_size=(2, 2))

Conv2D(64, (3, 3), padding= 'same')
Activation('relu')
BatchNormalization( axis=-1)

MaxPooling2D(pool_size=( 4, 4))

""" Dense part """
Flatten(name ='features')
Dropout(0.6)
Dense( 32)
Activation('relu')
BatchNormalization()
Dense( 10)
Activation('softmax')
For the first few convolutions I used convolutional layers with 32 filters and a (4,4) kernel with same padding to not change the first 2 dimensions of 64x64 of the images, 32 filters since it was enough to capture the first most "general" patterns of the pictures.
They all use ReLU activation because they are hidden layers thus preventing gradients from vanishing in case the weight values become too large while also being fast to compute and applying a non-linearity to the output tensor.
This is always followed by a BatchNormalization layer to standardize the output and facilitate the learning of the next layers.
MaxPooling2D of pool_size 2x2 and 4x4 to lower the tensor's first 2 dimensions from 64x64 down to 4x4.

I then use a Flatten layer to collapse the tensor into a single dimension with size 1024 enabling it to be an input to a dense layer of 32 neurons preceded by a Dropout layer with probability of 60% to have some regularization effect which helps to prevent overfitting. Overfitting is a concern especially in this dataset since we have a limited dataset of only 4000 samples. The dense layer has 32 neurons because the results showed this amount was enough for the network to interpret the last feature map produced by the convolutional layers.
Yet again I used a BatchNormalization layer to standardize the output and facilitate the learning of the last layer of this network which only has 10 neurons because each one represents a different primary type of the pokemons on the dataset.

The final layer is a softmax activation because in this specific task we intend to create a model to predict the primary type of a pokemon given an image of it, each pokemon only has 1 primary type therefore if the output of each of the 10 neurons from the previous layer is the predicted probability of that specific type being the primary type, then it is logical that we would want to maximize the probability of 1 type(the true primary type) over all the other 9 types, which is the intended use of the softmax activation since the sum of the activation values of a softmax always sums up to 1 and outputs the class with the highest value which is our prediction.

The loss function used was the categorical_crossentropy since we intend to predict the single correct class out of 10 classes total, and we intend to maximize the probability of the predicted class to be the pokemon's true primary type and minimize all others. Hence this is a very cut and dry classification problem therefore it is an obvious use of the categorical_crossentropy.

2.2 Multilabel classification:
The overall model architecture is exact same that of the one with best performance of the multi-class problem to predict the pokemon's primary type, since the task here is essentially the same with the exception of the output because we are no longer just predicting the pokemon's primary type but also their secondary type. This changed the problem from a multi-class problem with 10 total possible types but each output would only be 1 type, in this case the primary type, to a multi-label problem with the exact same 10 total types but this time we also want to predict the secondary type, meaning we need to output 2 types.

In terms of architecture this change would require that the output neurons to be independent because now we do not want only 1 class over all the others, rather we want each neuron to output the probability of the pokemon being from that type and the network would then pick the neurons with output higher than 0.5, for this reason I changed the output activation function from softmax to sigmoid.

This technically means that the network could output 3 or more types instead of the desired 2 but this can be avoided by training the network sucessfully with the apropriate labels that only ever have 2 types per sample.
To be able to make the network correctly learn we need a loss function that would judge each output neuron independently and suggest the correct changes to eventually mold it into the desired output of 2 types per example.
That loss function is the binary_crossentropy.
2.3 Semantic segmentation:
This problem requires a different type of network achitecture called an Auto-encoder.

This network consists of convolutional layers to capture patterns from the pictures, paired with ReLU Activation functions to perform non-linear transformations of the convolutions followed by batch normalizations to facilitate learning of for the next layers.
The convolution layers kernels are consistent (3,3) mostly applying the same logic that was used for the previous tasks, because we are still using the same pictures as input therefore its logical to use the same succesful kernel from before.
Padding always "same" to preserve the same first 2 dimensions through the convolutions, this facilitates the understanding of what is happening to the input throughtout the network.

The convolutional layers filters progressively decrease from 32 to 16 to 8 throughout the network until they start to reverse, combined with MaxPooling2D layers and UpSampling2D layers, this is standard auto-encoder architecture since the objective is to learn the compressed representation of a given input in this case a 64x64x3 tensor. This tensor is transformed throughout the enconder part of the network starting from 64x64x3 as original dimensions into 16x16x8. The third dimension is changed by the convolution layers and the first 2 dimensions get progressively halfed by the MaxPooling2D with (2,2) that essentially cuts the tensor into its more relevant half since this layer is used after the convolutions and MaxPooling2D will capture the more prominent patterns.

Then at the decoder part we perform the same convolutions and pooling in reverse order, but in this case no "Reverse MaxPooling2D" exists so we will use UpSampling2D to re-increase the first 2 dimensions of the tensor.
With all operations being performed at exact reverse order we will get the exact input dimensions of 64x64x3 but in this case to perform semantic segmentation there is no need for a concept of color therefore we can get rid of the extra 2 dimensions we have on the tensor's third dimension with a convolutional layer
with a single filter combining all the 3 channels into 1. With this our  network has the desired output dimensions which for semantic segmentation is simply the input dimensions but with 1 channel.
Input(shape=(64,64,3), name ='inputs')

""" Encoder """
Conv2D(32, ( 3, 3), padding='same'
Activation( 'relu')
BatchNormalization(axis=-1)

MaxPooling2D(pool_size =(2, 2))

Conv2D(16, (3, 3), padding='same')
Activation('relu')
BatchNormalization(axis=-1)

MaxPooling2D(pool_size =(2, 2))

Conv2D(16, (3, 3), padding='same')
Activation( 'relu')
BatchNormalization(axis=-1)

Conv2D( 8, (3, 3), padding='same')
Activation('relu')
BatchNormalization(axis= -1)

""" Decoder """
Conv2D(8, ( 3, 3), padding='same')
Activation( 'relu')
BatchNormalization(axis=-1)

Conv2D(16, (3, 3), padding ='same')
Activation('relu')
BatchNormalization(axis=-1)

UpSampling2D( size=(2, 2))

Conv2D( 16, (3, 3), padding= 'same')
Activation('relu')
BatchNormalization(axis=-1)

UpSampling2D( size=(2, 2))

Conv2D( 32, (3, 3), padding ='same')
Activation('relu')
BatchNormalization(axis=-1)

Conv2D( 1, (3, 3), padding ='same')
Activation('sigmoid')
The choice of activation function was sigmoid because in this semantic segmentation we are only trying to select the pixels that belong to the pokemon so we can easily assign a sigmoid to every pixel of the output tensor that will determine if that it belongs to the pokemon or the background by outputting a value closer to 1 for the pokemon and 0 for everything else. The used loss function is binary_crossentropy since every pixel is technically independent from every other pixel so it is logical to use a loss function that measures and adjusts each pixels value independently but still be able to measure the overall accuracy of the predictions comparing them to the given masks.

3. Results

I selected the best models based on their validation accuracy and specially their validation loss. Here are the training results for the best model created for each task. Their code is also in the section above.
This is a comparison between the segmentation mask in the dataset and the segmentation mask predicted by the network for some of the examples of pokemons. In Black are the pixels that belong to neither of the masks. In White the pixels common to both masks. Red pixels are pixels present in the dataset mask but not in the prediction mask and Green pixels are the opposite.
This visualization was specially useful to find the best model for the segmentation problem which was the one with the least amount of green and red pixels.