Article Contents
Yue-Yan Qin, Jiang-Tao Cao, Xiao-Fei Ji. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1269-5
Cite as: Yue-Yan Qin, Jiang-Tao Cao, Xiao-Fei Ji. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1269-5

Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3

Author Biography:
  • Yue-Yan Qin is a master student in control theory and control engineering at Liaoning Shihua University, China.Her research interests include image processing and intelligent video analysis.E-mail: 18341318515@163.comORCID iD: 0000-0002-6225-3519

    Jiang-Tao Cao received the Ph. D. degree in intelligent control from University of Portsmouth, China in 2009. Now, he is a professor and M. Sc. supervisor at Liaoning Shihua University, China.His research interests include intelligent method and its application, and video analysis. E-mail:

    Xiao-Fei Ji received the M. Sc. in control theory and control engineering from Liaoning Shihua University, China in 2003, and the Ph. D. degree in pattern recognition and intelligent systems from University of Portsmouth, UK in 2010. From 2003 to 2012, she was a lecturer with School of Automation, Shenyang Aerospace University, China. Since 2013, she has been an associate professor with Shenyang Aerospace University, China. She has published over 40 technical research papers and 3 books. She is the leader of National Natural Science Foundation Project (61103123) and six national and local government projects. Her research interests include vision analysis and pattern recognition, information processing and fusion.E-mail: (Corresponding author)ORCID iD: 0000-0001-8279-7727

  • Received: 2020-06-10
  • Accepted: 2020-11-16
  • Published Online: 2021-02-02
通讯作者: 陈斌,
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (9)  / Tables (3)


Abstract Views (15) PDF downloads (13) Citations (0)

Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3

Abstract: Recently, video-based fire detection technology has become an important research topic in the field of machine vision. This paper proposes a method of combining the classification model and target detection model in deep learning for fire detection. Firstly, the depthwise separable convolution is used to classify fire images, which saves a lot of detection time under the premise of ensuring detection accuracy. Secondly, You Only Look Once version 3 (YOLOv3) target regression function is used to output the fire position information for the images whose classification result is fire, which avoids the problem that the accuracy of detection cannot be guaranteed by using YOLOv3 for target classification and position regression. At the same time, the detection time of target regression for images without fire is greatly reduced saved. The experiments were tested using a network public database. The detection accuracy reached 98% and the detection rate reached 38 fps. This method not only saves the workload of manually extracting flame characteristics, reduces the calculation cost, and reduces the amount of parameters, but also improves the detection accuracy and detection rate.

Yue-Yan Qin, Jiang-Tao Cao, Xiao-Fei Ji. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1269-5
Citation: Yue-Yan Qin, Jiang-Tao Cao, Xiao-Fei Ji. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1269-5
    • Fire is a serious natural and social disaster. On the one hand, the occurrence of fire will cause great threat to people′s lives and property safety[1], on the other hand, it will also cause huge loss to the natural and socioeconomic environment. According to the statistics of the World Fire Statistics Center, the number of fires worldwide each year is an astonishing number. In recent years, the incidence of fires has generally been on the rise, and the situation is very serious. Therefore, preventing and reducing fires as much as possible has always been a topic of active exploration.

      For many years, researchers have continued to research and experiment on fire detection methods. The original fire detection method is based on sensors, and according to different sensor types and applications[2], it can generally be divided into five categories: light-sensitive, temperature-sensitive, smoke-sensitive, gas-sensitive and composite. Due to the characteristics of heat release and dense smoke during a fire, temperature and smoke sensors are commonly used. However, this detection method based on sensors has significant drawbacks in terms of the detection range and the detection speed[3]. Then, with the application and popularization of video surveillance technology, researchers have obtained fire images through video surveillance and used their color characteristics to detect fires. However, there is a greater false detection rate in fire detection using only color features[4].

      In recent years, on the basis that the advancement of video surveillance technology can be seen in many public and private fields, the image processing technology in the field of machine vision has also made significant research progress. Through the video monitoring system, the color, shape change, texture structure, flicker and other related scene information of the fire image can be obtained intuitively[5], and the transmission and sensing speed have been improved. Therefore, fire detection technology based on computer vision came into being and promoted the diversity of fire detection methods. The fire detection method based on computer vision obtains fire images through video surveillance and manually extracts their features, and builds a detection model based on these features. Specific modeling methods can be divided into feature-level and decision-level model construction. The feature-level fusion fire detection method makes good use of the complementarity between different flame features, but it is not easy to achieve the fusion of heterogeneous features. Consequently, researchers have further studied the decision-level fusion of multiple flame features for this problem. The decision-level fusion fire detection method has a certain fault tolerance, but its preprocessing cost is relatively high.

      At present, fire detection methods based on feature-level and decision-level modeling have made certain research progress. However, this detection method relies on manually extracting visible features of the flame. These features only reflect the shallow features of the flame and may cause information loss in the process of manual extraction. With the continuous development of research, the fire detection methods using manual extraction of traditional features have entered a bottleneck in terms of application scenarios, detection accuracy and detection speed. In recent years, with the success of convolutional neural networks (CNN) in static image classification and the breakthrough progress of deep learning theory in the field of machine vision, fire detection using its powerful feature representation ability and modeling ability has important research value and application prospects[6]. After deep learning has used convolutional neural networks to achieve image classification, researchers have introduced new target detection algorithms on this basis, such as two-step target detection based on region convolutional neural network networks (R-CNN) and end-to-end target detection algorithms based on You Only Look Once (YOLO) and single shot multibox detector (SSD) networks. These algorithms and models make up for the problem that traditional convolutional neural networks can only classify but cannot locate fire targets.

    • The video and image of the flame have rich visible features, such as color features, texture features, flicker features, flame sharp angles, and shape changes. Among those flames, color feature is the earliest and most widely used. Shidik et al.[7] proposed the use of multiple color spaces as criteria for fire detection based on the uniqueness of the flame color. Han et al.[8] combined a variety of color feature rules to model fire detection methods after preprocessing the fire video. But, the fire scene is usually diverse, so using a single feature for fire detection will cause a high false detection rate.

      So, researchers prefer to use multi-feature fusion for fire detection. The fusion methods can be divided into two types: feature-level fusion and decision-level fusion. The feature-level fire detection method is to comprehensively analyze and process multiple features of the flame. Zeng et al.[9] used the weighting method to fuse multiple features of flames, and their weighting coefficients were obtained by an analytic hierarchy process. Prema et al.[10] used support vector machines to identify the flicker feature and texture feature of the flame for fire recognition. Prema et al.[11] used extreme learning machines for fire detection. The decision-level fusion fire detection method is to classify or identify each flame feature to form corresponding results, and then fuse the results to give the final decision. Shi et al.[12] proposed to use two color space discrimination rules and flame motion characteristics to perform fire recognition, and each recognition result was processed in parallel to obtain the final detection result. Foggia et al.[13] proposed to use a multi-expert system for fire identification and detection, and then fuse the recognition results to obtain the final classification result. Li et al.[14] used color attributes, geometric attributes and motion attributes to perform fire detection respectively, and the detection results obtained were fused again to obtain the final decision.

      Based on manually extracting features and modeling for fire detection, many scholars have also begun to use deep learning models for fire detection. Frizzi et al.[15] proposed a fire detection method for feature maps directly using the classic convolutional neural network AlexNet model. Muhammad et al.[16] used the method of transfer learning to fine tune the GoogleNet convolutional neural network for fire detection. Mahammad et al.[17] proposed a method of using the squeezeNet model of smaller convolutional kernels to identify the fire on the basis of a traditional convolutional neural network. Saeed et al.[18] proposed a fire detection method which is based on powerful machine learning and deep learning algorithms, their proposed model has three main deep neural networks, i.e., a hybrid model which consists of Adaboost and many multilayer perceptron (MLP) neural networks, the Adaboost local binary patterns (LBP) model and finally a convolutional neural network. Compared with traditional computer vision based fire detection methods, fire detection based on convolutional neural network models has made certain progress and has better stability. However, traditional convolutional neural network models can only classify fire images and cannot accurately locate the location of fire occurrences. Therefore, based on the use of convolutional neural networks for stable classification, deep learning based object detection algorithms have received more attention and applications, which have been extended to the research for fire recognition detection. Kim and Lee[19] used a faster R-CNN two-step target detection algorithm for fire detection. Liau et al.[20] proposed to use an efficient squeezeNet network to replace the back-end network of the SSDs network, and use residual connection and group convolution to expand the SSD framework based on the squeezeNet network for fire target detection. Shen et al.[21] used the algorithm-optimized YOLO model for fire detection. Du et al.[22] improved the candidate frame extraction and feature-level fusion algorithm of the end-to-end YOLOv2 model to detect fire targets. Ren et al.[23] used the improved YOLOv3 network model for fire classification and location regression. Although the target detection algorithm based on deep learning makes up for the problem that typical convolutional neural networks can only classify but cannot locate the fire position, due to the complexity of the target detection model and the need to consider both classification and position regression tasks, the detection time may increase.

      Due to the special nature of fire, its detection needs to weigh both detection time and accuracy. Therefore, a fire detection algorithm based on the combination of depthwise separable convolution and YOLOv3 is proposed under the premise of considering both detection speed and detection accuracy. Firstly, depthwise separable convolution is used to classify the fire image, which can greatly reduce the detection time without losing the detection accuracy. Then, the target regression function of YOLOv3 is used to output the position of the image whose classification result is fire. In the real scenario, the probability of fire is far lower than the probability of no fire, so only using its regression function saves a lot of time in detecting no fire. Compared with traditional video-based fire detection methods, the workload of manual feature extraction is reduced, and the detection accuracy is improved. Compared with the classic convolutional neural network, the method proposed in this paper can achieve the location of the fire target. In addition, the requirements on hardware are reduced, and the amount of calculation and parameters are greatly reduced. Compared with the typical deep learning target detection algorithm, the detection accuracy and detection rate of this algori- thm can meet the requirements of fire detection.

      The rest of the paper is organized as follows. Section 2 introduces fire detection model. Section 3 gives the network training method for the fire detection model. Section 4 discusses the testing results. Section 5 concludes the paper with some future work suggestions.

    • The proposed method uses two stages for detecting fire from the input video. The first stage uses the classification model to classify the input image with or without fire. The classification model uses depthwise separable convolutional neural networks (DS-CNN). The second stage uses the target regression function of YOLOv3 to locate the fire position information for the image with fire and then output, and directly output for the image without fire. The various stages of the algorithm are shown in Fig. 1.

      Figure 1.  Fire detection algorithm

    • In recent years, convolutional neural networks have made breakthrough progress in the field of image classification. The classic structure of traditional convolutional neural network models includes the first LeNet model for digital recognition, and models that won the ImageNet competition championship in 2012 and after, such as AlexNet, VGGNet, GoogleNet, and ResNet models, etc.[24] But, traditional convolutional neural networks use large-scale convolution kernels, such as 11×11 convolution kernels in AlexNet. Traditional convolutional neural networks usually use large-scale convolution kernels, such as 11×11 convolution kernels in AlexNet. The larger the convolution kernel is, the larger the receptive field will be, but the number of parameters of the model will also increase. The model after AlexNet has improved this, for example, GoogleNet uses multiple 3×3 small-size convolution kernels to cascade while keeping the original image receptive field unchanged[25], which greatly reduces the amount of parameters. But, as the depth of the network increases and the convolution kernel needs to act on each channel of the input image, the amount of calculation is still large. Aiming at the problems that the traditional convolutional neural network has a large amount of calculation and many parameters, the depthwise separable convolution was proposed in 2013[26]. Depthwise separable convolution is an improvement and innovation based on standard convolution. The core is to decompose the standard three-dimensional convolution into two-dimensional and one-dimensional convolution.

      1) Depthwise separable convolution

      The basic idea of depthwise separable convolution is to decompose the standard convolution into depth-wise convolution and point-wise convolution.

      Step 1. Depthwise convolution is to carry out 2D convolution for each channel of the input image to reduce the amount of parameters.

      Step 2. Pointwise convolution is based on depthwise convolution, using a 1×1 convolution kernel to convolute all channels, greatly reducing the amount of calculation. The difference between standard convolution and depthwise separable convolution in the convolution process is shown in Fig. 2. Fig. 2(a) is the standard convolution process. Figs. 2(b) and 2(c) correspond to depthwise convolution and pointwise convolution of depthwise separable convolution.

      Figure 2.  Standard and depthwise separable convolution process

      Assuming that the input feature map size is Df×Df×M, the output feature map size is Df×Df×N, and the convolution kernel size is Dk×Dk. The following is a calculation and comparison of the parameters and calculations involved in the standard convolution and the depthwise separable convolution process.

      2) Parameter amount

      The standard convolution parameter amount is

      $Pa{r_S} = {D_k} \times {D_k} \times M \times N.$


      The parameter amount of the depthwise separable convolution is

      $Pa{r_{D - P}} = {D_k} \times {D_k} \times M + M \times N.$


      3) Calculation amount

      The calculation amount of the standard convolution CalS is as follows:

      $Ca{l_S} = {D_f} \times {D_f} \times M \times N \times {D_k} \times {D_k}.$


      The calculation amount of depthwise convolution in depthwise separable convolution is shown in (4), and the calculation amount of point-wise convolution is shown in (5). The total calculation amount is shown in (6).

      $Ca{l_D} = {D_f} \times {D_f} \times M \times {D_k} \times {D_k}$


      $Ca{l_P} = {D_f} \times {D_f} \times M \times N$


      $\begin{split}\quad\quad Ca{l_T} =\;& {D_f} \times {D_f} \times M \times {D_k} \times {D_k} + \\ &{D_f} \times {D_f} \times M \times N.\end{split}$


      The ratio of the calculation amount of the depthwise separable convolution to the standard convolution is

      $\begin{split}\frac{{Ca{l_S}}}{{Ca{l_T}}} \!=\!\;& \frac{{{D_f} \times {D_f} \times M \times N \times {D_k} \times {D_k}}}{{{D_f} \times {D_f} \times M \times {D_k} \times {D_k} + {D_f} \times {D_f} \times M \times N}} =\\ &\frac{1}{N} + \frac{1}{{{D_k} \times {D_k}}}.\end{split}$


      According to the calculation, the reduction of the calculation amount of the depth separable convolution is related to the size of the convolution kernel Dk×Dk and the number of output channels N. The neural network models used in practical applications usually have multiple convolutional layers, and the convolution kernels usually use 3×3 and above convolution kernels. Thus, depthwise separable convolution can greatly reduce the number of parameters and calculations without losing accuracy, which also makes it possible to effectively apply deeper and wider neural network architectures. Even in resource-constrained micro controllers, it can run normally. Consequently, in this paper, the depthwise separable convolutional neural network is chosen as the fire classification model.

      4) Depthwise separable convolutional neural networks

      Based on depthwise separable convolutions, this paper proposes to use end-to-end depth separable convolutional neural networks to classify images with and without fire. The specific network structure is shown in Fig. 3. The specific network structure is shown in Fig. 3. It mainly consists of 4 convolutional layers, 3 pooling layers, and the pooling method is Max pooling, 2 fully connected layers, and 1 softmax regression layer. The first 9 layers of the network are used for feature extraction, and the last layer is used for classification. In addition to the above main structure, it also includes the activation function layer between the convolutional layer and the pooling layer. The activation function uses a rectified linear unit (Relu function) and a batch normalization (BN). And the dropout layer is added between the full connection layers, in which the probability is 0.5.

      Figure 3.  DS-CNN fire classification structure

      The structure of the fire classification network model:

      Input: The network input uses red-green-blue (RGB) image data, resizes the original image as 128×128×3, and standardizes each channel. The output result is used as the input of the first convolutional layer.

      Standard convolution module: A 7×7 conventional convolution is used between the input layer and the output layer of the network, followed by the BN layer and the RELU layer (Fig. 4(a)).

      Figure 4.  Convolution layer structure change

      Depthwise separable convolution module: It consists of 7×7 depthwise direction convolution followed by BN layer and RELU layer and 1×1 pointwise convolution layer followed by BN layer and RELU layer. The point convolution step is 1 (Fig. 4(b)).

      Pooling module: Using pooling layer parameters (2, 2), the feature map output by depthwise separable convolution module is down sampling to half, so as to reduce the dimension of fire characteristics.

      Output: After passing through the depthwise separable module and pooling module, the convolution in 3 × 3 depthwise direction convolution followed by BN layer and RELU layer and 1 × 1 pointwise convolution followed by BN layer and RELU layer, continued for three times, and the pointwise convolution step is 1. And three down sampling with pooling parameters is (2, 2). After the depthwise separation and pooling operations, the two-dimensional fire feature maps are transformed into a one-dimensional vector by using two full connection layers. The number of neural unit nodes in the full connection layer is 128. And a softmax activation function output is connected to obtain the classification of fire and non-fire at the same time.

    • Based on the classification results of fire data, the target regression of YOLOv3 is used to further locate the images that are classified as a fire. YOLOv3 uses the network structure of Darknet-53. Darknet-53 introduces a residual block in the network. The gradient problem of the deep network is solved, so that the training difficulty of the network is reduced. There is no pooling layer and fully connected layer in the entire network. The downsampling of the network is achieved by setting the convolution step to 2[27]. In addition, YOLOv3 can realize multi-scale detection, and the specific form of multi-scale detection is the operation of up sampling and splicing in the last certain layers of network prediction. The small scale feature maps can provide richer and deeper levels of semantic information, and the large size feature maps can provide target location information more accurately. Combining small-scale feature maps with meso-scale feature maps and large-scale feature maps can both detect large targets and effectively detect small targets[28]. YOLOv3 further uses three different scale feature maps to detect objects, which can detect more fine-grained features. The final output of the network has three scales: 1/32, 1/16 and 1/8, respectively, the 1/32 prediction results have a high sampling ratio and large receptive field of feature map, so it is suitable for detecting objects with a large scale in the image. The 1/16 prediction results have a medium scale receptive field, which is suitable for detection of medium-scale objects. The 1/8 of prediction results have the smallest receptive field, which was suitable for detecting small scale objects. The specific network structure is shown in Fig. 5. During the fire, the fire will change continuously, sometimes it is a small fire, sometimes it may be a large fire. Therefore, YOLOv3 is chosen as the fire location model in this paper.

      Figure 5.  YOLOv3 network structure

    • During the network training process, some parameters and algorithms need to be set and selected in advance. This includes parameters such as the learning rate, the number of iterations, the batch training, and the selection of data augmentation, loss functions, and optimizer.

      1) Initialization parameters

      The initialization parameters mainly include settings for learning rate, number of iterations, and batch training. For learning rate, run the learning rate finder method through a certain number of iterations before formal training, and generate the result into a learning rate finder plot. The learning rate finder plot in this paper is shown in Fig. 6.

      Figure 6.  Learning rate finder plot

      It can be seen from Fig. 6 that the network starts to gain traction between 10−5 and 10−4 and starts to learn. The lowest loss can be found between 10−2 and 10−1. However, at 10−1, the loss starts to increase sharply, which indicates that the learning rate is too large and the network is overfitting. So, in order to prevent the network from overfitting and guarantee the generalization ability, this paper uses an initial learning rate of 10−2. For batch training, if it is too small, the training data will be difficult to converge. If it is too large, the relative processing speed will increase, but the required memory capacity will also increase. Therefore, this paper chooses a batch training of 64 and performs 50 iterative trainings on the entire training set.

      2) Date augmentation

      In deep learning, the number of samples is generally large enough. When the number of samples is sufficient, the effect of the trained model is better and the generalization ability is stronger. However, in practical applications, the number of samples is often insufficient due to various factors, which requires data augmentation for existing samples to increase the number of samples. Common methods for data augmentation include data flipping, rotation, image scaling, cropping, translation, adding noise, etc.[29] Data augmentation will expand the amount of data. But, if the test samples do not have such randomness, it will not work and will increase the training time. Therefore, according to the characteristics of fire changes in the process of fire, we use rotation, scaling, horizontal translation, vertical translation, cropping and horizontal rotation to enhance the data. The rotation angle is controlled within 30°. The scaled size is within 0.15. The horizontal and vertical translations are controlled within 0.2. And the cropping transformation is controlled within 0.15. During the data augmentation process, the original data will not be modified. Instead, more similar and diverse data are obtained through image processing and other methods, and does not take up more memory space. All processing processes are processed on-the-fly in memory.

      3) Loss function

      The loss function is also called the cost function. It is a function of measuring the difference between the predicted value and the actual value of the output of the neural network. The loss function is often associated with optimization problems as a learning criterion. The commonly used loss functions are mean square error (MSE) loss function, binary cross entropy loss function, categorical cross entropy function. The MSE loss function is the most classic and simplest, but the accuracy is relatively poor. The binary cross entropy loss function is generally used for binary classification problems. The categorical cross entropy loss function is usually used in multiple classification cases. Since the fire classification is a binary classification problem, a binary cross entropy loss function is selected in this paper. The loss function expression is as follows:

      $Loss = - \sum\limits_i^n {\hat y} \log {y_i} + \left( {1 - \hat y} \right)\log \left( {1 - {y_i}} \right)$


      where n is the number of samples, $\hat y$ is the predicted value, and ${y_i}$ is the actual value. Differentiate the function with respect to y, the result is shown in the following formula (9):

      $\frac{{\partial Loss}}{{\partial y}} = \sum\limits_{i = 1}^n {\frac{{{{\hat y}_i}}}{{{y_i}}}} - \frac{{1 - {{\hat y}_i}}}{{1 - {y_i}}}.$


      When ${y_i} = {\hat y_i}$, the Loss is equal to 0. In addition, Loss is a positive number, and the greater the difference between the predicted value and the actual value, the greater the value of Loss.

      4) Optimizer

      The role of the optimizer is to update and calculate network parameters that affect model training and model output, such as learning rate. This makes it approximate or reach the optimal value, thereby minimizing the loss function. The most basic algorithm of the optimizer is the gradient descent method. At present, the three main types of gradient descent methods are batch gradient descent (BGD), stochastic gradient descent (SGD), and mini-batch gradient descent (MBGD). The BGD calculates the gradient for the entire data set in one update, which will cause a large amount of calculation and the calculation speed is very slow. For similar samples, BGD will be redundant when calculating the gradient. When the amount of data is large, the calculation amount of the algorithm becomes very difficult, and new data cannot be invested to update the model in real time. MBGD calculates a small batch of samples at a time, and the convergence is stable. It can make full use of the highly optimized matrix operations in the deep learning library to perform more efficient gradient calculations. But it also has shortcomings. On the one hand, the convergence rate is very slow when the learning rate is too small. On the other hand, the loss function will continue to oscillate at the minimum value when the learning rate is too large. SGD only selects one sample for calculation, which has no redundancy and is relatively fast. It can also add new samples. So, the SGD optimizer is often applied at present. However, because the SGD algorithm is updated frequently, the loss function will have serious oscillations. Therefore, the momentum SGD is used in this paper, where the momentum parameter is set to 0.9. The role of adding momentum parameters in SGD is mainly to accelerate convergence, improve accuracy, and reduce oscillations during convergence. The parameter update expression is as follows:

      ${\theta _i} = {\theta _i} - \eta \left( {{h_\theta }(x_0^{\left( j \right)},x_1^{\left( j \right)}, \cdots ,x_n^{\left( j \right)}) - {y_j}} \right)x_i^j$


      where $\theta $ is the model parameter, $\eta $ is the learning rate, $j$ is the sample, $h({x_i})$ is the randomly selected gradient direction, and yj is the loss function.

    • In this paper, a depthwise separable convolutional neural network has been used to classify the input fire data. Next, only the data with fire information need to be located. In other words, it is not necessary to use YOLOv3 to perform classification prediction, but to use its positioning function to output the fire location when the classification is known. Therefore, we use the YOLOv3 model to locate the fire through transfer learning. The specific training steps are as follows:

      Step 1. Use labeling software to frame the fire sample data and process it into the data format required by the YOLOv3 model to generate a training set of fire images.

      Step 2. Modify and adjust the classification prediction function and configuration file parameters in the YOLOv3 model accordingly.

      Step 3. Use transfer learning to retrain the YOLOv3 model using our own labeled database.

    • The software environment of the experiment in this paper is the ubuntu 16.04 LTS operating system. We compile the program under the TensorFlow2.0 framework and use python 3.6.6 as the programming language. The microprocessor of the hardware platform is Intel (R) Core (TM) i7-4790 with 3.6 GHz main frequency and 15.6 GiB memory.

    • The experiment uses two types of fire sample data for training and testing, respectively. One of the data sets is composed of fire images on Google and Baidu. They are all images taken at a certain time when the fire occurred. There is no time series relationship and no gradual process, as shown in Fig. 7(a). Another data set is composed of public fire video set. By dividing the video into frames and selecting them at equal intervals, its purpose is to establish a potential time series relationship between data sets, covering the fire data from small to large fire processes, as shown in Fig. 7(b). Both data sets include different scenarios such as indoor, outdoor, forest, road, and day and night. Negative samples are a natural complement to fire scenes. It is composed of scenes with similar characteristics to fire occurrence and disturbances similar to fire. In order to be able to compare whether different data sets can change the recognition accuracy, the two fire data sets are 1719 frames, and the negative sample set is 2689 frames, of which 75% are used for training and 25% are used for testing. During the training process, the model will also apply data augmentation functions to use its rotation, translation, scaling and other operations to enrich the training samples. The loss and accuracy of training and testing are shown in the following Fig. 8.

      Figure 7.  Part of the fire data list

      Figure 8.  Training loss and accuracy

    • The detection results obtained by using the classification model and the location model based on the classification model are shown in Figs. 9(a) and 9(b), respectively.

      Figure 9.  Test result

      In order to prove the effectiveness of the algorithm, the experiment makes a comparative analysis from the following aspects:

      1) In the same experimental environment, the test results of two different fire data sets using depthwise separable convolutional neural networks are compared and analyzed, as shown in Table 1.

      Data setsAccuracyFalse detection rate
      No time series relationship94.0%6%
      Time series relationship98.0%2%

      Table 1.  Comparison of the detection results of different fire data sets

      In Table 1, by comparing the detection results of the two fire data sets, it can be seen that the fire data with a time series relationship has higher detection accuracy and lower false detection rate than the fire data without a time series relationship.

      2) Under the same experimental environment and network structure, the fire data set with a time series is used to compare the detection accuracy and detection rate of the standard convolutional neural network (CNN) and deep separable convolutional neural network (DS-CNN), as shown in Table 2.

      Network structureAccuracyDetection rate
      CNN97.8%20 fps
      DS-CNN98.0%50 fps

      Table 2.  Comparison of the detection results of CNN and DS-CNN

      In Table 2, there is no significant difference in the detection accuracy of the two network structures, but, great changes have occurred in the detection rate. Therefore, it is proved that the depthwise separable convolutional neural network can greatly improve the detection rate while ensuring the detection accuracy.

      In Table 2, there is no significant difference in the detection accuracy of the two network structures, but great changes have occurred in the detection rate. Therefore, it is proved that the depthwise separable convolutional neural network can greatly improve the detection rate while ensuring the detection accuracy.

      3) Compare the fire detection algorithms of this paper and related literatures, as shown in Table 3.

      Related literatureAccuracyDetection rate
      This paper98%38 fps
      Literature [14]90%20 fps
      Literature [18]96.93%
      Literature [21]98.8%40 fps

      Table 3.  Comparison of the detection algorithms of related literature and our algorithm

      In Table 3, through comparative analysis of different algorithms, we can see that the algorithm proposed in this paper has achieved good results in accuracy and detection rate, among which the detection rate and accuracy are higher than those in [14] and [18], but slightly lower than that in [21]. Li et al.[14] classified the fire image through the classic structure of the convolutional neural network AlexNet, but did not locate the fire position. Because of the larger convolution kernel of the AlexNet model and the greater number of layers than the algorithm in this paper, its detection accuracy and detection rate are significantly lower than the algorithm in this paper. Saeed et al.[18] used a two-step faster RCNN target detection structure to classify and locate fire images. The faster RCNN model uses the region proposal network (RPN) instead of the selective search method to generate a candidate target box, which improves the algorithm′s detection accuracy and detection rate. But, it is still difficult to meet the requirements of real-time detection. Shen et al.[21] improved the detection accuracy and detection rate of the fire by improving the clustering method of YOLOv2′s network structure and the fusion of shallow and deep features. But, the algorithm in this paper does not change the network structure. By combining the classification model with the location model to achieve the classification and location function of the fire, it not only improves the detection accuracy, but also increases the detection rate.

    • Fire prevention and real-time detection are of great significance for protecting people′s property, forest vegetation, chemical equipment, etc. Many experts and scholars continue to improve and innovate the fire detection algorithm to meet the detection requirements of the real environment. In recent years, research on fire detection methods has gradually expanded from the traditional feature extraction algorithm to the field of deep learning, and has achieved certain results in this process. Therefore, this paper uses a 10-layer depthwise separable convolutional neural network as a classification model for fire images, which greatly reduces the amount of calculation and parameters. Then, on the basis of the classification, it only uses the target regression function of YOLOv3 to locate the fire location. The method proposed in this paper can not only ensure the accuracy of the algorithm, but also meet the real-time requirements of fire detection. The detection accuracy and rate are 98% and 38 fps, respectively. And it has good applicability to the detection of different scenes. Further work will focus on further simplifying the model structure and embodying the timing information of the video sequence in the network structure.

    • This work was supported by Liaoning Provincial Science Public Welfare Research Fund Project (No. 2016002006), and Liaoning Provincial Department of Education Scientific Research Service Local Project (No. L201708).

Reference (29)



    DownLoad:  Full-Size Img  PowerPoint