Every time I imagine CNN something spills out from my brain and forces me to restart my learning. I guess it was because I wasn't doing practicals on CNN. Many guys basically look on CNN as a theory and that is where even I lost my way of learning. However, Coursera, Edx, and Udacity helped me to amplify my knowledge about CNN and those big words like pooling, strides etc. I am not a genius in CNN but yes I know something about CNN.
So What is CNN?
Convolution Neural Network is a branch of AI where features from images are gathered up and compared with the input data. It is basically a voting system where every pixel votes for the outcome and as usual the one with maximum votes win in this game and we get a result like this
So how does this happen is what comes in my mind first.
A CNN takes an image as input and converts them into arrays. Yes those numpy arrays are for the same !! Do remember that a CNN never matches the whole image instead it matches small features of images with the input image.
So let us pass an image of a number '3' to our CNN. The CNN will look at the image like this
The human eye can clearly see the digits 3 being displayed. However, it's not so easy for the machine to see the image. Now to extract out the details this CNN will multiply a weight Matrix to the above mattress that represents the number 3. After multiplying the weight Matrix the result will be like this
Fig - Reference Image of 3 to classify
Comparing both the images side by side you will recognize that the following image represents more features and detail than the previous image
model.add(32, (3, 3), input_shape = (3, width, height))).
Here we have 32 filters of size 3 X 3.
*** To clear out the confusion I will say it again. The CNN has been trained on a variety of images of 3. It now knows the features so what CNN does is that it will grab a filter with "trained" details and will match "that with the validation data" i.e the above image of '3' is an image that we want to check. The network hasn't seen this image before. The filter contains the cell data from the trained image. ***
So here I have taken a 3 X 3 filter that knows the feature of '3' which will repartition the cell by a rule which state mark all cells 1 whose value is greater than 128 and 0 whose value is less than 129. Thus our filter belonging to the top left corner will represent like this
Now, this filter will move across every part of the reference image ('3') matrix to match and find the feature pixel by pixel.
So how it's done?
Each cell content from the filter will be multiplied with the similar cell in the image that we want to classify.
After multiplication and addition, we get something like this
Moving this filter at every step across the reference image we get the following information
Voila !. The filter matches itself with every part of the reference image and outputs the probability of value in each cell. Do notice that the fourth column of reference 3 image has been neglected by the filter in the fourth column of the filtered image. It is because our filter is 3 X 3 thus square in shape and every 3 X 3 matrix in the reference image has to match this filter whereas the fourth column doesn't. Similarly, we place 32 filters across the reference image and extract out the features. Every filter will be from a trained image and it will move across the reference image and compare each pixel. The pixel which passes through the filter gets a higher probability thus, it gets darker. On the contrary, the pixel which cannot pass the filter is lightened.
We can clearly see how the red circled pixels match the reference image and also the filter.
After applying 3 more filters you will get this as the output
This methodology of stacking various filters containing a bunch of features in them, over an image is called a Convolution Layer. Thus each image is a stack of various filtered images. Moving the filter across the whole image, we get the information about the location of the pixels.
Coming to the input shape of the image, the input shape is represented in Keras as (3, width, height) or by (width, height, 3). So to find the width just calculate the no. of pixels horizontally of the reference 3 image. You will get 4. Similarly calculate the no. of pixels vertically. You will get 24. Now each pixel is represented in form of RGB. Thus, it has 3 channels. So our input shape for the CNN will be (3, 4, 24) or (4, 24, 3).
I will deal with Max Pooling, Padding and Strides in the next post.