How to identify Dog breed with Convolutional Neural Networks Algorithm

kai cen
9 min readOct 6, 2021

github repo: https://github.com/cenkai88/udacity-dog-breed-capstone

Problem Statement

There are around 500 breeds of Canis lupus familiaris, or dog in English. Of course it is extremely difficult for people to recognize every dog’s breed unless you are a zoologist. But with the power of convolutional neural networks, we might be able to build an algorithm than is able to identify quite many dog breeds in few seconds.

Project Overview

This project is part of Data Science Nanodegree Capstone Program with user-supplied images of dogs and humans. I built a CNN(Convolutional Neural Networks) pipeline to categorize images. If the image contains a dog, the algorithm will make an estimate of its breed and if supplied an image of a human, the code will identify the resembling dog breed. If there is no dog nor human, the code tells user there is no dogs or humans. The model may provide a simple example of image content recognition and identification. Provided with different training data, the model may be used to identify other items, such as other animals, plants and so on.

Besides the cutting-edge CNN models for classification, I also built the Flask backend using the saved CNN model as well as a front-end web for users to upload the image.

This project consists of three sections:

1. Build a ML pipeline to train the CNN model to classify dog breeds and export the saved data.

2. Build backend APIs with flask to support the web app.

3. Run a web app which that users can upload images and see the results.

Metrics

Metrics used to measure performance of the model to be built is the dog breeds Identification accuracy for all the three models(human detection, dog detection and dog breed classification).

Accuracy is the proportion of correct predictions among the total number of cases examined, which is a simple and reasonable metric because we do not have “false negative” or “false positive” since we have 133 breeds to classify.

Data Exploration & Data Visualization

The datasets were provided by Udacity.

We got 13233 total human images and 8351 total dog images with 133 categories. And dog dataset is divided into train, valid, test subsets.

As shown in the bar chart, dog breeds in the training set is generally balanced, but there are still some breeds with less than 30 images while some has more than 70 images. If the slight imbalance poses problem on our model, we can try to over-sample the minorities to solve it.

Data Preprocessing & Implementation

  • Detect Humans

Before detecting humans, we need to load the xml file with pre-trained face detectors of Haar feature-based cascade classifier. After that, it is standard procedure to convert the images to grayscale. Using detectMultiScale function of the detector we loaded, we are able to get a boolean value indicating whether human faces are detected.

We can use this procedure to write a function that returns True if a human face is detected in an image and False otherwise. This function, aptly named face_decector, takes a string-valued file path to an image as input and appears in the code block below.

As the result shows, all 100 human faces were detected as human faces while 11/100 dog faces were also detected as human by mistake.

To decrease the error rate, I built another face detection algorithm on the LFW dataset and got 91.00% accuracy. You can check the ipython notebook in github repo for more details.

  • Detect Dogs

When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with shape {nb_samples, rows, columns, channels} where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

The path_to_tensor the function takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape (1, 224, 224, 3).

The paths_to_tensor the function takes a NumPy array of string-valued image paths as input and returns a 4D tensor with shape (nbsamples, 224, 224, 3). Here, nb_samples is the number of samples, or a number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image).

The predict method functions to extract the predictions, it returns an array whose 𝑖-th entry is the model's predicted probability that the image belongs to the 𝑖-th ImageNet category. This is implemented in the ResNet50_predict_labels function below.

The categories corresponding to dogs appear in an uninterrupted sequence corresponding to keys 151–268, inclusive, to include all categories from Chihuahua to Mexican hairless. So, if the function returns any number between 151 to 268, the supplied image is that of a dog. The dog_detector function below functions similarly with the face_detector above.

Result shows that 0% human images were detected as a dog while 100% dog images were detected as a dog.

  • Create a CNN to Classify Dog Breeds (from Scratch)

Before modeling, we rescaled the images by dividing every pixel in every image by 255.

I chose the hinted architecture with three convolutional layers to extract different features of images. Used dropout to provide redundancy to the network to avoid overfitting, average pooling to decrease dimensions and dense layer for multi-class classification.

This model takes a 4D-tensor with shape (1, 224, 224, 3) and provides an array of 133 probabilities(breeds). The optimizer used was RMSProp and the metric used was accuracy. The model was run for 20 epochs and provided an accuracy of 5.50%.

Refinement

  • Use a CNN to Classify Dog Breeds

To reduce training time and improve classification accuracy, I tried to train a CNN using transfer learning using the the pre-trained VGG-16 model as a fixed feature extractor, where the last convolutional output of VGG-16 is fed as input to the model. Only a global average pooling layer and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax are added.

It turned out the accuracy is 44.7368%.

  • Create a CNN to Classify Dog Breeds (using Transfer Learning)

Based on the result of previous step, I found that transfer learning is a very effective way to improve accuracy. In this section, I leveraged the bottleneck features from Resnet50 model.

The output of the last convolutional layer of Resnet50 model is fed as input to my model. A Flatten function is used to transform the tensor into one dimensional vector. Dense layer is designed to further elucidate the features of the image. After that, the layer contains one node for each dog breed is applied with a softmax activation function.

Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task. Resnet50 is a go-to network for basic image classification tasks and is trained on ImageNet data which includes 118 different breeds of the dogs. We chose ResNet50 because it is faster to train with much less trainable parameters as compared to VGG16.

The final accuracy for transfer learning based on Resnet50 algorithm is 77.39%.

  • Wrap it up

To wrap up the whole pipeline and be prepared for the app development, I wrote a function that accepts a file path to an image and first determines whether the image contains a human, dog, or neither.

Model Evaluation and Validation

Compared with VGG model, Resnet50 is the final choice since it has much higher accuracy (77%) than that of the VGG(44%). The reason behind might be that Resnet50 are much deeper compared to VGG because it refers to a 50 layers Resnet.

Also, thanks to transfer learning, we used just 20 iterations(epochs) to make Resnet50’s loss function converge and got a pretty satisfying accuracy on test dataset showing its ability to generalize.

I tested the Resnet50 model in some different scenarios, the model correctly deal with the dog, human and bird images.

Justification

For human face detection, we use a traditional machine learning algorithm using OpenCV(haarcascades) which led to 11/100 dogs are recognized as human. Then I built a CNN binary classification model based on lfw dataset and got 91.00% accuracy, proving the CNN algorithm’s superiority on traditional haarcascades algorithm in dealing with images with unclear background or none-standardized views.

For dog breed classification, at first we tried to train a CNN model from scratch but it turned out the efficiency is pretty low and accuracy is just around 5%. Then the use of transfer learning solve the problem of long training time and low accuracy since it utilize the pre-trained information in .npz file based on larger dataset.

Reflection

For the entire project, we built or used three models:

  • one CNN model for human face detection trained on lfw dataset(91.00% accuracy).
  • one ResNet-50 model for dog detection which is pre-trained on very large ImageNet dataset.
  • one CNN model for dog breed classification trained with transfer learning using Resnet50 bottle_neck features(77.39% accuracy).

In the end, we got a pipeline that works well in recognizing dogs and humans. It is clear that we can obtain descent results with the adopted techniques in acceptable training time.

A simple web application in Flask was built to leverage the model to predict breeds through images user uploaded: https://github.com/cenkai88/udacity-dog-breed-capstone.

Improvement

The model works as expected in most cases with good accuracy. But sometimes huskies were detected as Alaskan malamute.

Possible improvement points:

1. Use data augmentation techniques including horizontal reflections and mean subtraction to help modeling.

2. Choose only apply the detected dog part in image to the algorithm, since images user uploaded often includes backgroud parts that might affect the recoginition.

3. Try and evaluate more pre-trained models like Xception.

--

--