Volume 17 Number 2
March 2020
Article Contents
Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi and Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233-246, 2020. doi: 10.1007/s11633-019-1207-6
Cite as: Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi and Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233-246, 2020. doi: 10.1007/s11633-019-1207-6

Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network

Author Biography:
  • Kittinun Aukkapinyo received the B. Sc. degree in information and communication technology from Faculty of Information and Communication Technology, Mahidol University, Thailand in 2019. He is currently a Data Scientist with Wongnai Media Co., Ltd, Bangkok, Thailand. His research interests include pattern recognition, computer vision, multimedia information retrieval, and machine learning. E-mail: kittinun.auk@gmail.com ORCID iD: 0000-0002-1095-0320

    Suchakree Sawangwong received the B. Sc degree in information and communication technology from University of Mahidol, Thailand in 2019. He is currently a Unity Developer with Proudia company, Bangkok, Thailand. His research interests include image processing, computer vision, multimedia, and machine learning. E-mail: suchakree.sri@gmail.com

    Parintorn Pooyoi received the B. Sc degree in information and communication technology from University of Mahidol, Thailand in 2019. He is currently a Java Developer with Siam commercial bank, Bangkok, Thailand. His research interests include image processing, computer vision, multi-thread programming, machine learning, and deep learning. E-mail: parintorn.poo@gmail.com

    Worapan Kusakunniran received the B. Eng. degree in computer engineering from the University of New South Wales (UNSW), Australia in 2008, and the Ph.D. degree in computer science and engineering from UNSW, in cooperation with the Neville Roach Laboratory, National ICT Australia, Australia in 2013. He is currently a lecturer with the Faculty of Information and Communication Technology, Mahidol University, Thailand. He is the author of several papers in top international conferences and journals. He served as a program committee member for many international conferences and workshops. Also, he has served as a reviewer for several international conferences and journals, such as International Conference on Pattern Recognition, IEEE International Conference on Image Processing, IEEE International Conference on Advanced Video and Signal based Surveillance, Pattern Recognition, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on Image Processing, IEEE Transactions on Information Forensics and Security, and IEEE Signal Processing Letters. He was a recipient of the ICPR Best Biometric Student Paper Award in 2010, and also a winner of several national and international innovation contests. His research interests include biometrics, pattern recognition, medical image processing, computer vision, multimedia, and machine learning. E-mail: worapan.kun@mahidol.edu (Corresponding author) ORCID iD: 0000-0002-2896-611X

  • Received: 2019-06-12
  • Accepted: 2019-10-29
  • Published Online: 2019-12-17
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (10)  / Tables (9)

Metrics

Abstract Views (826) PDF downloads (34) Citations (0)

Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network

Abstract: This paper proposes a solution to localization and classification of rice grains in an image. All existing related works rely on conventional based machine learning approaches. However, those techniques do not do well for the problem designed in this paper, due to the high similarities between different types of rice grains. The deep learning based solution is developed in the proposed solution. It contains pre-processing steps of data annotation using the watershed algorithm, auto-alignment using the major axis orientation, and image enhancement using the contrast-limited adaptive histogram equalization (CLAHE) technique. Then, the mask region-based convolutional neural networks (R-CNN) is trained to localize and classify rice grains in an input image. The performance is enhanced by using the transfer learning and the dropout regularization for overfitting prevention. The proposed method is validated using many scenarios of experiments, reported in the forms of mean average precision (mAP) and a confusion matrix. It achieves above 80% mAP for main scenarios in the experiments. It is also shown to perform outstanding, when compared to human experts.

Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi and Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233-246, 2020. doi: 10.1007/s11633-019-1207-6
Citation: Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi and Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233-246, 2020. doi: 10.1007/s11633-019-1207-6
    • In many countries, rice is one of the main food supplies for daily life. When trading rice grains between farmers and companies, one factor to cut-off the price is about a pureness of each particular type of the rice grains. Many times, the main type comes with the partially mixing of other types. This is the very important process since each type of rice grains has a different selling price. Currently, most of rice mills, for example in Thailand, classify rice grains manually by random sampling them from a big chunk. This process relies on the expertise of each individual human operator. As time passes, the knowledge of rice grain classification can be lost and unsustainable. Also, since this process is done by a human expert/staff at the rice mill, it can lead to many problems.

      For example, the existing process to classify types of rice grains in the trading process is quite crucial. Human experts in each rice mill have unequal skills and experiences. In other words, the classification performances of human experts can be varied and different in terms of correctness and operation time. In addition, there is an issue of a standard of the classification of rice grains, when the staff of the mill cannot control such standards in the trading process. It is troublesome to investigate and justify a trading process in each mill. Therefore, it could be better if the technology in the computer vision field is applied to address this issue.

      Because of the afore-mentioned problems, this paper is to develop a new solution to automatically classify types of rice grains in digital images. This could help to standardize the rice trading process using the proposed machine vision-based solution. An input image used in the proposed solution could contain any number of rice grains in any rotations and any alignments. However, they are requested to be not overlapped with each other. The proposed method would automatically localize and classify each rice grain using the mask region-based convolutional neural networks (R-CNN) based approach[1,2].

      There are some existing research works related to this paper[38]. Most of them aim to classify a grain into its corresponding class using traditional machine learning algorithms. They had similar objectives as in this paper. However, our proposed method is significantly different from their solutions. Our work uses a deep learning framework that is capable of object detection/localization and classification tasks. It can learn low-level to high-level features of rice grains in an image and can achieve high accuracy with a set of grains that share similar physical appearances. Moreover, it can localize and classify each grain in an input image. In contrast, in the related works, an input image must contain only one grain. Therefore, the proposed solution in this paper is more practical for real-world usage.

      For example, Liu et al.[3] applied the neural network (NN) based on 14 morphological features and 7 color features to classify 6 types of rice seeds. The seed images were captured using the charge-coupled device (CCD) color camera in a controlled illumination. The principal component analysis (PCA) was also applied to reduce the feature dimension into four principal components, before passing into the NN. They reported the accuracy in a range of 74% to 95%. Bhensjaliya and Vasava[4] reviewed the research work on the classification of 4 types of rice seeds. The input images were captured using the 8 megapixels with 1.5 $ \mu $pixels digital camera. They used the shape-based and color-based information for the seed segmentation and classification. They could achieve 100% accuracy because the 4 types of rice seeds were absolutely in different shapes and colors. Kuo et al.[5] proposed a method to classify 30 types of rice seeds. The input images were captured using many sources including the digital camera (EOS 450D, Canon), the microscope (BXFM, Olympus), the 2X objective lens (PLN UIS2, Olympus), and the ring-shaped LED illuminator. They relied on the sparse-representation-based classification. The morphological, color, and textural traits of the grain body, sterile lemmas, and brush were quantified. They reported an average accuracy of 89.1%.

      In addition, Yi et al.[6] proposed a method to identify seeds that shared some common morphological traits, using the multi-kernel SVM. The input images were captured using the microscope where seeds were placed on a glass slide. The image thresholding of a saturation channel in the Hue-Saturation-Value (HSV) color model was used for the seed segmentation. Colors, shapes and textures were extracted as the main features. The color histograms were constructed for the a and b channels of the lab color space. The histogram of curvature (HoC) was used to describe shapes of the seeds since it could describe a shape information at various scales. Finally, scale invariant feature (SIFT), speeded up robust features (SURF) and root SIFT were used to describe the texture information of seeds. They reported an average accuracy of 97%.

      Wang and Cai[7] introduced a method to classify 91 types of weed seeds. They used the low-level image feature extraction for the seed classification. The low-level features included size, shape, color, and texture. The high-level features were extracted using the principal component analysis network (PCANet). Then, the local mean-based nonparametric classifier (LMC) method was employed as the main classifier. They reported an average accuracy of 64.8%. Rexce and Usha[8] proposed a method to identify 13 types of rice seeds. They extracted 57 features including 5 shape and size features, 48 color features and 4 texture features. Four different meta-heuristic classification techniques including artificial neural network (ANN), support vector machine (SVM), decision tree (DT), and Bayesian network (BN) were attempted for the classification. They reported that ANN achieved the highest classification accuracy of 92.3%.

      However, the types of seeds tested in all existing works were significantly different from each other. In contrast, the types of rice grains used in our work are very similar to each other, as can be seen in the sample image in the experimental section. The high-level based features and conventional classification techniques are not sufficient to solve our research problem. Thus, the proposed method is developed based on the deep learning based approach. In this paper, 5 main types of Thai rice grains are used in the experiment, including 1 type of sticky rice and 4 types of paddy rice. The looks of these 5 types are nearly the same in terms of shape, color and texture.

      The proposed method begins with the auto-alignment process to justify all rice grains in an input image. Then, the mask R-CNN[1] is used to train and validate the rice grain classification. The proposed method contains many detailed steps as described in the below section of this paper. It is evaluated in many scenarios and also compared with the performance of human experts in the field.

      The rest of this paper is organized as follows. The method is proposed in Section 2. The experiment and discussion are described in Sections 3 and 4 respectively. Then, the conclusions are drawn in Section 5.

    • The proposed framework for developing the rice grain classifier consists of 4 processes which are rice grains image capturing, data preparation, data modeling, and model evaluation, as shown in Fig. 1. Each process has several sub-processes which are involved in different techniques. The rice-grains images were captured in a laboratory setting. Then, the captured image is preprocessed with the automatic steps of cropping, scaling, and auto-alignment. Then, the data is split into the training set, validation set, and testing set.

      Figure 1.  Overview framework of the proposed solution

    • Rice grains image capturing is the process of capturing the rice grains in different settings using a digital full-frame camera to the high resolution of an image. Capturing settings was controlled by camera distance, luminance and white balance in the laboratory. A digital camera mounted on a tripod is used to capture images in a close distance of approximately 30 centimeters. Then, captured images are cropped to have a square aspect ratio. Moreover, they are scaled in order to have a similar size of grains at the center of an image.

    • The data preparation here is about the image annotation. It is the process to create an annotation file of the rice grain image by extracting the area of each rice grain or contour from training images, using the marker-based watershed algorithm[9,10]. The detailed explanation is in the below sub-section.

      The mask R-CNN[1] and transfer learning[11,12] were used in data modeling as the main method. Then, the dropout regularization was applied in the network for improving the model. The process of data modeling is shown in Fig. 2. The mask R-CNN will be created in its training mode. Our mask R-CNN architecture is depicted in Fig. 3. All images in the training and validation datasets are passed into the network with their groundtruths from annotation files. In each training epoch, the mask R-CNN will extract annotated objects from each image and pass them into the residual network and feature pyramid network. It serves as the feature extractor that can learn low-level to high-level features. Then, a training loss is calculated after it learns features from each image. At the end of each epoch, model weights are generated. They are used for calculating validation loss, where the mask R-CNN is applied to detect and classify grains in the validation dataset which is unseen to the model weights. Data modeling is stopped when a validation loss starts to increase or it remains not significantly different for some consecutive epochs.

      Figure 2.  Data modeling process

      Figure 3.  Architecture of mask R-CNN used in the proposed solution

      Although the deep learning can produce a high-performance object detection model, it requires an enormous number of training data to be fed into the network in the learning process. Therefore, the transfer learning is used to initialize weights of the deep learning network to help with convergence and consumption of training resources. Pre-trained weights of the same CNN architecture in one task can be used as initial weights for the new task. Therefore, pre-trained weights will be loaded into CNN architecture of mask R-CNN before it starts a training process with the parsed training dataset. This process can be repeated by adjusting the hyper-parameters and modify some parts of the neural network architecture in the case that the performance of the model is not satisfactory.

    • Marker-based watershed is an algorithm that can be used to perform image segmentation. It treats a grayscale image as a topographic surface and pixel′s intensity as a height[9]. For example, dark areas can be considered as a low height and luminous areas can represent more height. Imagine that an image is flooded with water, the area where water merged can be considered as a boundary for each foreground in an image. The watershed algorithm is used in the annotation process to segment rice grains from the black background in the captured image and keep it as contours. In this paper, contours are calculated from the topological analysis of the digitized binary images algorithm.

    • The pre-processing in this paper consists of 3 sub-processes including cropping, scaling, and auto-alignment. The captured image must be cropped as a square image before entering the mask R-CNN. The cropped image is set to 1 024 × 1 024 pixels because it is the default setting of the mask R-CNN framework. For environment controlling, the rice grain classifier has to control the length of rice grains in a predicting image to be the same scaling as in training images. The predicting image is extracted with the important pieces of information which are contours and major-axis, for computing a scaling value as shown in the second row of Fig. 4. The S is a scaling value. The ALP is an average rice grain length of predicting image. The ALT is an average rice grain length of the training image. Then, the original size is multiplied with a scaling value.

      Figure 4.  Sample input and step-by-step outputs of the proposed method. The first row shows an input image. The second row shows segmented contours of detected rice grains. The third row shows the resulted image of auto-aligned rice grains. The fourth row shows the classification outputs of detected rice grains with confidential scores.

      $ S = \dfrac{ALP}{ALT}.$

      (1)

      In addition, the auto-alignment is a new technique developed for adjusting the alignment and position of each grain in a random alignment setting automatically. The first process is to extract information of an input image that might have a random alignment setting. The extracted information contains a number of grains, a major axis of each grain, a center of each grain, a rotation of each grain, and a boundary box of each grain. A contour is the set of positions around each grain. The watershed algorithm is used to find contours. After contours are extracted from an input image, the maximum length of the major axis of each grain is calculated by searching a maximum length between two points in a contour. Next, the rotation degree is calculated from a slope of the major axis. Then, the program will compute the transformation matrix for applying with output image M. It is a transformation matrix with the size of 3 × 3, dst is the output image, and src is the input image. After that, the program will rotate and crop the boundary box of each grain in the original image. The next process is to create an output image. First, the program creates a new image with the size of 1 024 × 1 024 pixels. The background of an output image is the repetition of background areas from an input image. Next, the program will replace rotated grains on their corresponding boundary boxes′ locations from the previous step.

      $ dst(x,y) = \left ( \frac{M_{11}x\!+\!M_{12}y\!+\!M_{13}}{M_{31}x\!+\!M_{32}y\!+\!M_{33}}, \frac{M_{21}x\!+\!M_{22}y\!+\!M_{23}}{M_{31}x\!+\!M_{32}y\!+\!M_{33}} \right ) . $

      (2)
    • Mask R-CNN[1] is an extension of the faster R-CNN framework[13] that can perform a pixel-level segmentation. In other words, it can extract an area of the detected object from the background. It can be implemented to have two modes of operations which are training and inferencing mode. In training mode, model weights of training data are trained and generated from training dataset with annotation files. It can be trained from scratch or using pre-trained weights. Another mode is interference mode which is used for detecting and identifying the object using model weights. The model weight will be loaded into the network. Then, mask R-CNN can detect and classify the found object in input images. The architecture of mask R-CNN is shown in Fig. 3.

      In this research work, ResNet50 is used as a main architecture in our proposed framework. It is 50 layers deep as can be seen in [14]. The input to our mask R-CNN framework is an image of a size 1 024 × 1 024 pixels, containing 10 rice grains as shown in the first row of Fig. 4. In the output layer, a number of nodes is varied depending on a number of classes (i.e., types of rice grains) in each scenario.

    • The transfer learning allows us to transfer knowledge from one task into one similar task[11]. It is the practice of transferring neural networks weights between similar tasks. The pre-trained network′s weights will be used as initial weights in the retraining process for a new task. So, it can help to reduce the time and resources used in data modeling with the deep neural network. Apart from reduced training time and resources, transfer learning requires much less training data when compared to the model training from scratch. One of the drawbacks of the deep learning approach is the insufficient amount of training data that can lead to an issue of overfitting of the model.

      Also, a model might not be able to converge with an insufficient amount of data. However, transfer learning allows us to retrain the network with existing pre-trained weights. Therefore, it allows us to develop a deep learning model with less training data and time. Not only there is reduced training time, but also it is beneficial when we have less data available for training. In other words, it can help a model to converge if even a little training data is available. In the fully connected layer, they are replaced with the network specifically designed for our problem of 5 classes of rice grains.

    • It is a simple approach to overfitting prevention in deep learning. It can be used to prevent or reduce neural networks with many hidden layers from the overfitting issue[15]. Its concept is to randomly ignore some neurons with probability 1 – p during training by inserting dropout layers into the network. It is to reduce the complexity of the network by just temporarily dropping some neurons out from the network. So, this technique can help to reduce codependency amongst neurons during the training process, which can lead to overfitting. Normally, it should be applied right before fully connected layers with a dropout rate of 0.5 to reduce cohesion among neurons which lead to a reduction in the complexity of the network. In other words, it should be put before a classifier layer. Therefore, more generalization of the model can be achieved via this technique since the network tends to learn or extract features less than normal network architecture.

    • There are 4 experimental scenarios in this research. In each scenario, a model is trained with a different setting to find the best training setting. There are 5 types of Thai rice grains in our experiments, as shown in Fig. 5. They are RD6, RD15, RD23, RD75, and ML105. RD6 is one type of Thai sticky rice while the rest of them are subtypes of Thai paddy rice. Paddy rice grains in this research project have very similar physical appearances, so it is difficult to distinguish them even with real human experts.

      Figure 5.  Sample images of rice grains targeted in this research work

      The first step is to digitize obtained physical rice grains into digital images. The Sony Alpha a5100, which is a commercial digital camera, is used as a capturing tool. The capturing process is done in a laboratory-controlled environment with same camera setting and background scene. A black paper is used as background since it is convenient to annotate rice grains in an image. There are 10 rice grains in each capturing scene. All captured images were cropped to have a square aspect ratio and re-sized down to 1 024 × 1 024 pixels.

      Several datasets are created with a different capturing setting and target classes to find the best model in each scenario from captured images. The ratio of the number of images in the training dataset to validation dataset is approximately 80 : 20 for the scenario 1, 2, and 3. The training dataset is used for modeling data from rice grain images. Next, the validation dataset is used for fine-tuning the data modeling in order to find the best model during the training process.

      However, MIMR8, which is the dataset for the fourth scenario, is split into three sub-datasets which are training, validation, and testing datasets. It was prepared for the performance comparison between our trained classifier and the real human experts in the real-world scenario. So, a testing dataset is needed for the comparison with real human experts since it is unseen to a model during training and fine-tuning phases. This dataset has 3 target classes which are RD6, good quality paddy rice (including RD15 and ML105), and bad quality paddy rice (including RD23 and RD75). The detail in each dataset is shown in Table 1.

      DatasetCapturing settingTarget classes# Training image/
      # Training grains
      # Validation image/
      # Training grains
      MIMR1Manual alignment
      Same orientation
      Same flip
      Only 1 side captured
      Reuse some grains
      RD15, RD23, RD75, ML105104/104021/210
      MIMR2Same as MIMR1 but with closer camera distanceRD15, RD23, RD75, ML105104/104022/220
      MIMR3Same as MIMR2Sticky Rice, Paddy Rice162/162042/420
      MIMR4Random alignment
      2 sides of grains were captured
      Sticky Rice, Paddy Rice320/160080/400
      MIMR5Same as MIMR2RD6, RD15, RD23, RD75, ML105131/131027/270
      MIMR6Random alignmentRD15, RD23, RD75, ML105104/104022/220
      MIMR7Random alignmentRD6, RD15, RD23, RD75, ML105131/131027/270
      MIMR8Random alignmentRD6, good quality paddy, bad quality paddy72/72018/180
      Note: There are 10 rice grains in one image

      Table 1.  Details of 8 datasets used in the experiments

      All images in each dataset were annotated by starting from segmenting grains using the watershed algorithm. After that, we annotated each segmented grain into its corresponding class and saved its location in an image into a JSON file. The annotation file of our project is saved in a JSON format for the ease of use with the mask R-CNN library from matterport[16].

      After datasets are prepared, they are parsed into the mask R-CNN library to create a data model. This library uses a Python3 and TensorFlow framework to create a deep learning model and inference image. The mask R-CNN network is created in a training mode and pre-trained weights of the Microsoft COCO dataset[17] is loaded into a network. The pre-trained weights are provided with the library from matterport. The computational resource is the single NVIDIA′s GeForce GTX 1080 Graphics Card on a standalone desktop computer. Due to its memory limitation, it took around 2 s per image in the training phase. It is a mid-tier graphics processing unit (GPU) that has an affordable price.

      Experiments are conducted by creating a model that will classify target classes in each scenario. Some hyperparameters such as learning rate and weight decay rate are fixed for all scenarios. The learning rate is set to 0.001 to prevent exploding of weights in the network and ResNet50[14] as a CNN architecture for all models. The number of steps per epoch is set to the number of images in the training dataset. After hyperparameters are defined, we start the training epoch by epoch and visualize training results using loss at the end of each epoch. In general, it takes 5-6 hours per experimental setting in each scenario.

      There are two evaluation metrics which are used to validate the trained model. They are the mean average precision (mAP) and confusion matrix. A mean average precision is used to evaluate the overall performance of a model and confusion matrix to describe its performance on the classification task. The minimum detection confidence is 0.7, so any positive bounding boxes which have confidentiality less than 70% are discarded. In addition, step-by-step sample outputs of the proposed method are illustrated in Fig. 4.

    • This scenario is to create a model for classifying a rice grain into sticky and paddy rice using MIMR3 and MIMR4 datasets. The target classes of this scenario are sticky and paddy rices. Rice grains were manually aligned with the same orientation and flip in MIMR3 while they are randomly aligned in MIMR4. Both datasets were parsed into our mask R-CNN and their learning in each epoch as shown in Fig. 6. From the training and validation loss graph, the training loss of both datasets looks identical while the validation loss of MIMR3 tends to be better than MIMR4. From the evaluation process, MIMR3 got the highest mAP and its classification performance is described using a confusion matrix as shown in Table 2. There is no loss prediction and false positive at all.

      MIMR3 mAP: 1.0Predicted
      Sticky ricePaddy riceTotal
      ActualSticky rice2000200
      Paddy rice0220220
      Total200220420

      Table 2.  A confusion matrix of MIMR3 on its validation dataset

      Figure 6.  A learning graph visualization for each setting in scenario 1

    • This scenario is to create a model for classifying rice grains into 4 subtypes of paddy rice grains including RD15, RD23, RD75, and ML105. There are 3 datasets used in this scenario, which are MIMR1, MIMR2 and MIMR6. All rice grains in MIMR1 and MIMR2 are manually aligned during the capturing process while they are randomly aligned in MIMR6. The difference is that a camera distance for MIMR2 is closer to rice grains than MIMR1. The learning of each dataset in each training epoch is shown in Fig. 7. From a training graph, all datasets had fluctuated training losses but all of them achieved lower training loss after several epochs are passed. MIMR1 has better loss value in both the training and validation phase than MIMR2. The validation loss of MIMR6 fluctuated and is quite on par with MIMR2.

      Figure 7.  A learning graph visualization for each setting in scenario 2

      There are three attempts to improve the performance of the model. The first one is to apply an auto-alignment function. Rice grains are going to be automatically aligned to have the same size, direction, and alignment as those in MIMR1 and MIMR2 with this function. The next two attempts are on the network′s architecture and images. MIMR2 is used for the next two attempts to compare learning results with a normal training setting. The second attempt is to reduce the complexity of a convolutional neural network architecture by using dropout regularization[15]. Dropout layers with small dropout rate were randomly put in a convolutional neural network, which is the ResNet50. Another attempt is to preprocess images before parsing them into the network by using the contrast_limited adaptive histogram equalization (CLAHE) technique[18].

      The learning of the model with three attempts are shown in Fig. 7. From the results of training, the training loss of MIMR6 with an auto-alignment is high at first 65 epochs. Then, it becomes on par with MIMR1. The trend of training and validation loss of MIMR2 improved after two techniques were applied. However, there are no significant differences in training and validation loss between MIMR1 and MIMR2 after we applied dropout regularization and preprocessing by increasing contrast.

      From the evaluation process, MIMR2 with increased contrast achieved the highest mAP and its classification performance is described using a confusion matrix as shown in Table 3. In overall, the confusion matrix looks promising but there is a significant confusion between RD75 and ML105 classes.

      MIMR2-IncCon mAP: 0.817Predicted
      RD15RD23RD75ML105Total
      ActualRD155711059
      RD230504357
      RD751434241
      ML10502124559
      Total58575150216

      Table 3.  A confusion matrix of MIMR2 with increment of a contrast on its validation dataset

    • This scenario is to create the model for classifying rice grains into 5 subtypes which are RD6, RD15, RD23, RD75, and ML105. MIMR5 and MIMR7 were used in this scenario since they contained those target classes. Rice grains in MIMR5 are aligned manually while those in MIMR7 are arranged randomly. In Fig. 8, the graph of training loss of all datasets is mostly identical to each other. However, MIMR5 had better validation loss than MIMR7. So, there is an attempt to improve the performance of MIMR7 by using an auto-alignment function and it can achieve better validation loss; nevertheless, it is still not good as MIMR5.

      Figure 8.  A learning graph visualization for each setting in scenario 3

      From the evaluation process, MIMR5 with manual alignment got the highest mAP and its classification performance is described using a confusion matrix as shown in Table 4. In overall, the confusion matrix seems to be promising but there is still a notable confusion between RD75 and ML105 which are paddy rice grains. In other words, paddy rice grains were confused in our classifier.

      MIMR5 mAP: 0.798Predicted
      RD6RD15RD23RD75ML105Total
      ActualRD650010050
      RD1505170058
      RD2300474556
      RD7500230941
      ML10500094857
      Total5051564362262

      Table 4.  A confusion matrix of MIMR5 on its validation dataset

    • This scenario is to create the model for classifying rice grains in the real-world scenario, in which grains of rice are classified by its market price into 3 classes of 1) sticky rice, 2) good quality paddy rice, 3) bad quality paddy rice. The good and bad qualities of paddy rice are specified by types of rice grains. RD15 and ML105 are good quality paddy rice, while RD23 and RD75 are bad quality paddy rice. In addition, a sticky rice grain also has a different market value when compared to paddy rice. So, the target classes in this scenario are sticky rice, good quality paddy rice, and bad quality paddy rice. The only dataset that we used in this scenario is MIMR8, which contains our target classes in the random alignment of grains setting. An experiment is conducted by comparing the performance of a model before and after applying an auto-alignment technique.

      The results of learning are visualized in Fig. 9. A graph of training loss of MIMR8 is almost identical after applying the auto-alignment technique. The trend of validation loss of MIMR8 with random alignment had the highest validation loss at epoch 10, then it improved as several epochs passed. The validation loss in the cases of random-alignment and auto-alignment settings are kept decreasing. However, they are not significantly different in the end.

      Figure 9.  A learning graph visualization for each setting in scenario 4

      Since this is a scenario for the real-world usage, MIMR8 has a testing dataset for comparing the performance of our classifier with real human experts. MIMR8 in random alignment is used as our classifier since it has the highest mAP on the testing dataset in this scenario. There are 4 human experts who participated in our experiment. They are asked to classify each grain in images in the testing dataset of MIMR8. In practice, human experts can use more information than 2D images to classify grains.

      For example, they can use sensual information such as touching and smelling which cannot be sensed from a 2D image. However, our proposed framework used only the 2D visual information to classify grains. So, the comparison against real human experts is done by allowing them to use only visual information in 2D. Hence, classification results from human experts are based on their vision in 2D. The confusion matrix and accuracy of our classifier and real human experts are shown in Table 5. As a result, our classifier has higher accuracy than real human experts on average when we classify grains based on the 2D images.

      ClassifierConfusion matrixAccuracy
      Predicted
      StickyGood qualityBad qualityTotal
      MIMR8-RandomActualSticky82016980.81
      Good quality0643296
      Bad quality2791100
      Total8471139294
      Predicted
      StickyGood qualityBad qualityTotal
      Expert # 1ActualSticky791381000.58
      Good quality26632100
      Bad quality07129100
      Total8115069300
      Predicted
      StickyGood qualityBad qualityTotal
      Expert # 2ActualSticky712721000.61
      Good quality0991100
      Bad quality08713100
      Total7121316300
      Predicted
      StickyGood qualityBad qualityTotal
      Expert # 3ActualSticky90731000.69
      Good quality3934100
      Bad quality37324100
      Total9617331300
      Predicted
      StickyGood qualityBad qualityTotal
      Expert # 4ActualSticky854111000.76
      Good quality18316100
      Bad quality04060100
      Total8612787300

      Table 5.  Comparison between the proposed classifier and real human experts

      From the evaluation process, MIMR8 with random alignment got the highest mAP and its classification performance is described using a confusion matrix as shown in Table 6. A model cannot clearly distinguish RD75 and ML105, which are subtypes of paddy rice grains. Therefore, it is selected as a representative model to compare the performance of our proposed classifier with real human experts. The result from comparison is shown in Table 5. MIMR8 with random alignment has the highest accuracy when compared to real human experts. It is noted that accuracy is calculated based on the confusion matrix. This is because real human experts have no loss prediction and the loss prediction of our classifier is not significant and can be ignored.

      MIMR8-Random
      mAP: 0.825
      Predicted
      StickyGood qualityBad qualityTotal
      ActualSticky4701259
      Good quality0511061
      Bad quality215760
      Total495279180

      Table 6.  A confusion matrix of MIMR8 on its testing dataset

      The mAP for all datasets and settings in all scenarios are shown in Table 7. It is noted that the evaluation in scenario 4 is done on its testing dataset while the scenarios 1, 2, and 3 are done on their validation datasets. Hence, the k-fold cross validation is applied to check if each model in each scenario is overfitting or not. The value of k in this experiment is 10. Since there is a limitation on a computational resource, k-fold cross validation is applied to only one dataset with best setting in each scenario. The result from k-fold cross validation is shown in Table 8. The average of mAP from all folds are higher than mAP of our trained model in each scenario.

      Scenarios# Training grainsmAP @ IoU = 0.5# Loss prediction
      1. Sticky and paddy rice
       1.1 Manual alignment1 6001.0000
       1.2 Random alignment1 6000.9559
       1.3 Random alignment with auto alignment1 6000.9790
      2. 4 subtypes of paddy rice (RD15, RD23, RD75, ML105)
       2.1 Manual alignment1 0400.72610
        2.1.1 Closer camera distance1 0400.7684
        2.1.2 Applying dropout regularization1 0400.7777
        2.1.3 Preprocessing using CLAHE1 0400.8174
       2.2 Random alignment1 0400.6417
       2.3 Random alignment with auto alignment1 0400.7015
      3.5 types of rice grains (RD6, RD15, RD23, RD75, ML105)
       3.1 Manual alignment1 3000.79810
       3.2 Random alignment1 3000.57615
       3.3 Random alignment with auto alignment1 3000.70523
      4. Real-world scenario (Sticky, good quality paddy, bad quality paddy)
       4.1 Random alignment7200.8251
       4.2 Random alignment with auto alignment7200.7584

      Table 7.  Summarization table of evaluation in all scenarios

      DatasetMIMR2-IncCon
      Fold #12345678910Average
      mAP (IoU = 0.5)0.8340.9150.9380.8530.8270.9380.8440.8600.7410.8830.8625
      Loss Prediction53113346443.4
      DatasetMIMR3
      Fold #12345678910Average
      mAP (IoU = 0.5)1.0001.0001.0001.0000.9850.9951.0001.0001.0000.9890.9969
      Loss Prediction00002100030.6
      DatasetMIMR5
      Fold #12345678910Average
      mAP (IoU = 0.5)0.8870.9480.8770.9210.9450.9570.8930.8660.93600.9020.9032
      Loss Prediction61815021052.9

      Table 8.  Summarization table of k-fold validation

    • In this section, two methods are re-implemented and tested on the same datasets (i.e., manual alignment) of the scenarios 1, 2 and 3 described in Table 7. The two methods are 1) a fusion of Gabor feature and local binary pattern (LBP) histogram[19], and 2) SIFT[6]. The multi-kernel SVM is used as a main classification tool for both methods[6,20].

      In Table 9, it can be seen that the proposed method outperforms other two existing methods in all three scenarios. This is because rice grains from different types (inter-variation) are very similar and rice grains from each type (intra-variation) are also varied. The high-level features used in the two existing methods could not deal sufficiently with these low inter-variation and high intra-variation. However, the proposed method relying on convolution networks could extract sufficient various features from learning samples in the learning process whereas both inter-variation and intra-variation are learned.

      MethodsScenarios# Training grainsPrecision
      Sticky and paddy rice1 600
      Gabor + LBP + SVM0.691
      SIFT + SVM0.722
      Proposed method1.000
      4 subtypes of paddy rice (RD15, RD23, RD75, ML105)1 040
      Gabor + LBP + SVM0.582
      SIFT + SVM0.630
      Proposed method0.817
      5 types of rice grains (RD6, RD15, RD23, RD75, ML105)1 300
      Gabor + LBP + SVM0.550
      SIFT + SVM0.596
      Proposed method0.798

      Table 9.  Experimental comparisons

    • In order to figure out the best setting in each scenario, validation datasets were used in scenarios 1, 2 and 3 to find out factors that affected the performance of the model. From the investigation, the manual alignment setting tends to be the best setting to classify Thai rice grains. The experimental results and comparisons are shown in Table 9.

      In addition, scenario 4 has separated testing datasets for testing generalization of a model and comparing its performance with the 2D visual ability of human experts in the rice mill.

      In the first scenario, the model for classifying sticky and paddy rice grains was created and it can achieve the highest accuracy regardless of their alignments. This might happen because sticky and paddy rice grains have clear differences in their physical appearance. Moreover, the training and validation loss decreased quickly since the difference between these two classes is significantly clear. Therefore, the alignment of grain did not yield much when there is a clear difference in the physical appearance.

      In the second scenario, the task is to classify paddy rice grains into its 4 subtypes which are RD15, RD23, RD75 and ML105. The challenge is that they share much similar physical appearance. The training and validation loss graphs are fluctuated and take longer time to improve when compared to the first scenario in all settings. This happens since it is hard for the normal human′s vision to classify them precisely. So, the accuracy of the best model in this scenario is acceptable even though it is not as high as the first scenario. The manual alignment setting achieved the best accuracy among all settings in this scenario but it would not be practical in the real-world situation. Then, an auto-alignment technique was applied to automatically align rice grains in the random alignment into the same common alignment as in a manual alignment scenario.

      It resulted in increased accuracy, so the rotation of rice grains affected the performance of our classifier. Since each subtype of paddy rice grains has similar physical appearance, the rotation can make a classifier confused easier than sticky and paddy rice grains. Apart from the rotation, we found out that camera distances are also contributed to the performance of the model since the mAP slightly increased with closer camera distance.

      In the third scenario, the model for classifying rice grains into 5 subtypes was created. It got the lowest accuracy on the validation dataset when compared with all other scenarios. The training and validation loss graph were fluctuated at first, but they kept improving after several epochs were passed. This might happen because of an increased in target classes and high similarity in physical appearance of paddy rice grains. As a result, the model might not be able to extract features efficiently and take longer training time to improve than the first scenario.

      Since the evaluation of scenarios 1, 2, and 3 were done on validation datasets, the k-fold cross validation was selected as an approach to check for generalization of the model. The selected number of k was 10 and it was applied to each best setting in each scenario. The average mAP from 10 folds in each scenario was better than our mAP on validation datasets. Therefore, there should be no overfitting issue.

      In this research work, 5 different types of rice grains are used, where a few hundreds of grains are provided for each type. Therefore, these include both intra-variations among different grains of each type and inter-variations among grains of different types. The proposed method is shown to promisingly deal with both variations, as can be seen from the experimental results. For example, in a case of classifying sticky rice-grains from paddy rice-grans (i.e., in Table 7), the proposed method is proved to well address challenges of both variations by achieving mAP of 100%, 95.5% and 97.9% in experimental scenarios of manual alignment, random alignment, and random alignment with auto-alignment step respectively.

      In the last scenario, the model for the real-world situation was created. It was to classify rice grains into sticky, good quality paddy, and bad quality paddy rice grains. The testing dataset was chosen to compare the performance of the model with the best setting in this scenario against the real human experts. The training graph from both random and auto alignment were almost identical. The random alignment setting had the fluctuation in the validation loss graph at first but the validation loss of both alignment were on par at the end of training. Our model performed better than the real human experts on average using the 2D vision. However, the human experts can investigate rice grains in more dimensions to gain more useful information for the classification such as smell and touch. Since our classifier could learn and perceive grains only in 2D, its performance was acceptable when compared to the real human experts with ability to extract features such as depth and weights of grains.

      All scenarios could not detect some rice grains in an input image as shown in Fig. 10, since there were loss detections in the result image where some rice grains were not recognized. The reason was that some proposals containing rice grains were discarded since they had a lower confidence score than the predefined threshold. This might happen because of the quality of the image in that area and flaw of the region proposal network. The most loss detection occurred in the third scenario.

      Figure 10.  A sample result containing a loss detection

      In future work, when applied to a real-world scenario, the proposed models of classifying types of rice grains should be used with images/videos of rice grains captured in a controlled boxset. The boxset can be created to be under a closed-environment of controlled lighting sources and a camera capturing rice grains on a plate at the bottom. To be used for a large mass, the plate can be replaced by an automatic belt transporting sample rice grains to be examined.

    • This paper proposes a framework to develop data models that can classify and localize each rice grain in an input image. The data model is trained using mask R-CNN with pre-trained weights of COCO dataset. The proposed framework is mostly an iterative process which consists of data acquisition, data preparation, data modeling, and model evaluation. There are 5 types of Thai rice grains which are used in this research. They are RD6, RD15, RD23, RD75 and ML105, containing very similar physical characteristics. Several models are constructed to localize and classify each grain in an image in various scenarios. The best performance model is for classifying sticky and paddy rice grains since it can achieve mAP of 1.0 when rice grains in an image are manually aligned. The worst scenario is when we develop the model to classify rice grains into 5 subtypes. The average MAP for all scenarios is approximately 0.75. In the end, we compared the developed classifier with human experts in the field. It shows that the trained classifier can achieve MAP of 0.8 which is better than the human experts on average.

Reference (20)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return