-
Image recognition and classification have successfully applied in various domains, such as face recognition[1, 2] and scene understanding for autonomous driving[3]. At present, human face identification is successfully used for authentication and security purposes in many applications. Therefore, there are attempts to extend studies from human to animal recognition. In particular, dogs are one of the most common animals. Since there are more than 180 dog breeds, dog breed recognition can be an essential task in order to provide proper training and health treatment. Previously, dog breed recognition is done by human experts. However, some dog breeds might be challenging to evaluate due to the lack of experts and the difficulty of breeds' patterns themselves. It also takes time for each evaluation.
Besides, there are several studies on using dog images to identify their breeds. Chanvichitkul et al.[4] proposed using coarse to fine classification by grouping similar face contours as a coarse classification and then applying a principle component analysis (PCA) classifier within the output group as fine classification. Prasong et al.[5] extended the coarse to fine classification by adding local parts to reduce misclassification within the same group. This method used normalized cross correlation (NCC) to find each local part, such as ears and face. Then, the dog breeds were classified in the PCA subspaces. It improved the runtime by four times and yielded an accuracy of 88% for 35 dog breeds.
Furthermore, a combination of shape and appearance features such as histogram of oriented gradient (HOG)[6] and scale-invariant feature transform (SIFT)[7] were used to classify breeds of cats and dogs[8]. The model achieved 69% accuracy for identifying 37 breeds of cats and dogs. Similarly, Liu et al.[9] reported an accuracy of 67% for 133 dog breeds from the Columbia Dogs Dataset by combining SIFT descriptors and color histograms on the SVM classifier with landmark data. Lai et al.[10, 11] introduced a deep learning method by transfer learning on convolutional neural networks (CNN) and achieved 86.63% accuracy on the same dataset in [9].
Most of the previous works used hand-crafted features that would find it difficult to discriminate between a large number of breeds. These selected features have been limited to certain types and may not contain sufficient information to increase the classification between breeds. Unlike conventional techniques, deep learning can create different features during training from the original images and achieved significant results that will be explored in this work. In summary, based on the public dataset, namely the Columbia Dogs Dataset[9], our proposed method achieves the highest performance of 89.92% accuracy when compared to other existing methods in [9-11].
Our main contributions are twofold. First, we propose using a convolutional neural network (CNN) based model for dog breed identification. In order to prevent overfitting and imbalance classes, we also apply transfer learning and data augmentation techniques such as cropping, translating, and rotating the original image. Therefore, three CNN architectures, including MobileNetV2, InceptionV3, and NASNet are evaluated and trained using dog face images. In our view, these results represent excellent baselines toward the further studies in dog identification using pre-trained model with some fine tuning. Second, the results from our model can demonstrate some preliminary findings of the selected areas on dog face images to use for classification. The key features which are significant to discriminate dog breeds are located in areas of eyes and nose.
The rest of this paper is organized as follows. Section 2 describes the applied technologies and approaches. Section 3 described the proposed model. The experiments are described and discussed in Section 4. Finally, the conclusion is drawn in Section 5.
-
Over decades, computer vision techniques have been developed and achieved significant performance due to deep learning approaches[12], e.g., convolutional neural networks (CNNs). According to the ImageNet Challenge[13], InceptionV3[14] is the first runner up for image classification in ImageNet 2015 and was published in CVPR 2016. It is the upgrade version of GoogleNet[15], which reduces computational complexity. The main concept was to factorize convolutions into smaller convolutions. Fig. 1 shows an overview of InceptionV3 architecture. MobileNetV2[16] aims to be a lightweight model. It is based on an inverted residual structure and depthwise separable convolutions to reduce the complexity and model size. NASNet[17] is a current state-of-the-art on ImageNet classification with 82.70% top-1 accuracy. The architecture is based on neural architecture search (NAS) framework. The main concept is to search the best convolutional layer (cell) on the smaller dataset (i.e., CIFAR-10) and apply this layer to larger data such as ImageNet by stacking copies of this layer.
Figure 1. Inception V3 architecture. Conv represents convolutional layer, FC represent the fully connected layer. Colored figures are available in the online version.
However, CNNs are known to have a high computation cost for training an entire model from scratch because they have several convolutional layers and connect with fully connected layers. It also requires a lot of data to get better accuracy and reduce overfitting. In order to reduce these limitations, there is a method called transfer learning. Transfer learning[17-19] refers to the transfer of existing weights from pre-trained networks on a large dataset such as ImageNet[20] and COCO[21]. The main purpose is to reuse the parameters in the feature extraction layers to produce feature vectors instead of training them. The model has then replaced its full connected layers (classification layers) with the fully connected layers of the new dataset. Therefore, we can reduce computation costs by training the model on new classification layers. In this project, we apply dog breeds as the classification output and dog face images as the input images. We compare three pre-trained networks, including InceptionV3, MobileV2, and NASNet, to see the most suitable net for this particular task of our research question. This could be a good guideline for any other similar tasks.
-
Since deep learning has outperformed in computer vision tasks, it requires a lot of training data to avoid overfitting. In the real world, data is limited due to various causes and might be an imbalance between classes. For instance, some breeds have less images than others because of their conservation. Therefore, several techniques have been attempted to overcome such limitation including dropout[22], transfer learning[18, 19, 23, 24], batch normalization[25], and data augmentation[26, 27]. In this paper, we apply transfer learning and data augmentation to reduce the overfitting problem.
Data augmentation is an approach to artificially increase the amount of training data by data wrapping or oversampling. Data wrapping is a technique that directly augments the existing images by performing geometric and color transformations such as cropping, translating, and rotating the image. Therefore, the augmented image preserved the same label as the input image, as showed in Fig. 2. Oversampling augmentation is another approach where an image is created by mixing images or using generative adversarial networks (GANs)[28]. In this work, we demonstrate the data wrapping augmentation to increase training images. Details of the setting will be explained in Section 3.
-
The proposed framework of dog breed classification is shown in Fig. 3. It consists of 3 main phases, which are data preparation, training and testing. Since we focus on dog face images, the data preparation step is required. Then, it is split for the training process and testing process. The output from the training model is a dog breed model. The model is used for breed classification and model evaluation. Details are explained in the following subsections.
-
In this study, we use a public dataset to evaluate our method. The Stanford Dogs Dataset[29] and Columbia Dogs Dataset[9] are the public datasets for dog breed classification. We employ the Columbia Dogs Dataset as the data in this study. It contains 8 351 dog images of 133 breeds by the American Kennel Club with 8-part locations annotated for each image. The sample images are shown in Fig. 4(a). Given the original images, it requires some pre-processing such as cropping and rescaling to extract dog faces as shown in Fig. 5. The pre-processed data is then split into a training set and testing set. The training set is augmented using data wrapping techniques such as rotation, flipping and adding noise.
-
In this paper, the dog breed classification model is constructed by using transfer learning techniques. With transfer learning, we can train the model with a small dataset by using existing pre-trained CNNs from a large dataset such as ImageNet. Fig. 6 shows an overview of the dog breed classification model by using InceptionV3 as a pre-trained model. The model takes dog face images as the input and creates CNN features using an ImageNet weight. Then it retrains the last fully connected layers with our dog breed data to build a new classifier.
Figure 6. An overview of transfer learning using the Inception V3 model. Transfer learning uses the feature extraction part from a trained model and retrains the new classification on the top layers
In order to test the dog breed classification model, we use the testing set that is split from the data preparation phase. Dog face images in the testing set are fed into the dog breed model, which is trained from the training phase. Then the model output is a predicted dog breed. All experiment settings and results will be explained in Section 4.
-
The proposed method is evaluated using two main scenarios in creating a training set: 1) apply without augmentation, 2) using various augmentation settings. In our experiments, we use Columbia Dogs Dataset[9], the images are pre-processed and cropped faces. Therefore, 8111 images are selected and split into training and testing sets. The training set contains 6781 images, and the testing set consists of 10 images per breed at a total of 1330 images. Each setting is evaluated using three pre-trained models from the ImageNet dataset, including MobileNetV2, InceptionV3 and NASNet. We retrain the networks using the Tensorflow library.
Since our training set is small based on the number of classes that we have, we augment the training set to increase the number of images and to improve the performance. We apply data wrapping augmentation to the training set and compare performance between several settings, including rotation, translation, and adding noise. However, the number of degrees for transformation is decided based on the possible transformations that would occur in the real images. For example, the degree of rotation would not exceed 45 degrees, and the translation would not need to exceed half of the image, as shown in Fig. 5. Then we randomly select 200 images per breed as our training set.
As shown in Table 1, training sets containing rotation and translation achieve higher performance for dog breed classification than the baseline without augmentation. The reported results show that NASNet model achieves overall highest performances regardless of the training set used. We achieve an accuracy of 89.92% using the training set containing rotation images. Fig. 7 shows the confusion matrix of the model.
Augmentation techniques Accuracy(%) MobileNetV2 InceptionV3 NASNet Without 80.82 87.50 89.10 Rotation 81.65 88.42 89.92 Translate 81.65 89.02 88.87 Noise 80.30 85.94 88.80 Table 1. Accuracy of dog breed classification from different CNN models
Figure 7. Confusion matrix from the highest accuracy (89.92%) using NASNet models and training set containing some rotation images. The breed names listed from bottom to top in the y-axis and from left to right in the x-axis are ordered in the alphabet-ordering of breeds′ names in the Columbia Dogs Dataset.
In addition, we evaluate our best setting using the 10-fold cross-validation, as reported in Table 2. We achieve an average classification rate of 89.74% with 1.07 standard deviations. Figs. 8 and 9 show the average classification accuracy and standard deviation for each breed. Our results show that the models can recognize most breeds with overall accuracy above 80%.
K fold Accuracy (%) SD 1 90.52 18.68 2 88.66 21.85 3 91.25 17.93 4 87.67 20.57 5 90.01 14.46 6 88.90 18.35 7 90.14 18.66 8 89.52 17.94 9 90.63 19.51 10 90.14 17.05 Avg±SD 89.74±1.07 Table 2. Accuracy of 10-fold cross-validation using the NASNet model on the training set with rotation images
-
In the previous study on image classification, NASNet achieved the highest accuracy on image classification using the ImageNet dataset[17]. Following our results in Table 1, we found that the results are in a similar orders, NASNet, InceptionV3, and MobileNetV2 regardless of training data. It confirms that the architecture of NASNet is current fit for our task. Although there are some improvements using augmentation techniques, we observe that adding noise reduces the performance. Noise can decrease the quality of images and could lead to model confusion. On the other hand, we found the rotating images can improve classification performance because testing images could be from various angles.
As shown in Table 3, we compare our proposed method and previous studies using the same dataset. Our result achieves the highest accuracy using NASNet with the augmentation training set, while previous methods used traditional feature selection techniques that would find it difficult to discriminate between a large number of breeds.
Model Accuracy Liu et al.[9] 67.00 BreedNet 86.63 NASNet with augmented data 89.92 Table 3. Accuracy of dog breed classification from different CNN models
While training, CNN layers in the network are updated by backpropagation from the optimizer and its loss function. Several features are generated from these iterations. In order to understand how the model distinguishes between dog breeds, we visualize a heatmap from the last feature extraction layer showing which parts of an image are used for classification based on higher weights, as shown in Fig. 10. The heatmap illustrates that the discriminative areas adopted in the classification are located in the center of an image, which contains the alignment between eyes and nose, their patterns, and textures, the rest are disregarded. Therefore, there are some confusions between breeds that have similar appearances and alignments of such faces' components, e.g, Lowchen/Havanese and Cardigan Welsh Corgi/Pembroke Welsh Corgi (Fig. 11). Since these breeds are similar or come from the same origin, using only their faces might not be able to distinguish them.
Figure 10. Original image (left) and its heatmap (right) generated from the final feature layer of the model. The yellow spots mean the higher weights
This study has demonstrated that using deep learning can identify dog breeds from a dog face image. These results represent an initial step toward animal identification. However, it remains a challenge for us to improve the classification. To further our research, we will focus on cross-breed dogs by combining other parts, such as the body's shape, color, and texture, for training the breed classification model.
-
The paper proposes a method for identifying dog breeds using their face images with a deep learning-based approach. The proposed method applies the transfer learning technique using pre-trained CNNs and image augmentation to improve accuracy. The experiments examine three CNN models, which are MobilenetV2, InceptionV3 and NASNet. Each model is trained using training data containing image augmentation, including rotation, translation, and random noise. The NASNet model with a training set containing rotation images achieves the highest accuracy of 89.92%. Rotation can help with an alignment of images because the model mainly focuses on the center part of images. However, the proposed method could achieve a promising performance, with above 80% of classification accuracy on all settings. It could improve a high accuracy with augmented datasets such as rotation and translation.
-
This research was supported by the Royal Golden Jubilee (RGJ) Ph.D. Programme under the Thailand Research Fund (No. PHD/0053/2561)
Knowing Your Dog Breed: Identifying a Dog Breed with Deep Learning
- Received: 2020-06-04
- Accepted: 2020-09-30
- Published Online: 2020-11-13
-
Key words:
- Computer vision /
- deep learning /
- dog breed classification /
- transfer learning /
- image augmentation
Abstract: Dog breed identification is essential for many reasons, particularly for understanding individual breeds′ conditions, health concerns, interaction behavior, and natural instinct. This paper presents a solution for identifying dog breeds using their images of their faces. The proposed method applies a deep learning based approach in order to recognize their breeds. The method begins with a transfer learning by retraining existing pre-trained convolutional neural networks (CNNs) on the public dog breed dataset. Then, the image augmentation with various settings is also applied on the training dataset, in order to improve the classification performance. The proposed method is evaluated using three different CNNs with various augmentation settings and comprehensive experimental comparisons. The proposed model achieves a promising accuracy of 89.92% on the published dataset with 133 dog breeds.
Citation: | Punyanuch Borwarnginn, Worapan Kusakunniran, Sarattha Karnjanapreechakorn, Kittikhun Thongkanchorn. Knowing Your Dog Breed: Identifying a Dog Breed with Deep Learning. International Journal of Automation and Computing, 2021, 18(1): 45-54. doi: 10.1007/s11633-020-1261-0 |