-
Single image super-resolution (SISR) aims to obtain high-resolution (HR) images from a low-resolution (LR) image. It has practical applications in many real-world problems, where certain restrictions present in image or video such as bandwidth, pixel size, scene details, and other factors. Since multiple solutions exist for a given input LR image, SISR is to solve an ill-posed inverse problem. There are various techniques to solve an SISR problem, which can be classified into three categories, i.e., interpolation-based, reconstruction-based, and example-based methods. The interpolation-based methods are quite straightforward, but they can not provide any additional information for reconstruction and therefore the lost frequency cannot be restored. Reconstruction-based methods usually introduce certain knowledge priors or constraints in an inverse reconstruction problem. The representative priors can be local structure similarity, non-local means, or edge priors. Example-based methods attempt to reconstruct the prior knowledge from a massive amount of internal or external LR-HR patch pairs, in which deep learning techniques have shined new light on SISR.
This survey focuses mainly on deep learning-based methods and aims to provide a comprehensive introduction to the field of SISR.
The remainder of this paper is organized as follows: Section 2 provides the background and covers different types of example-based SISR algorithms, followed by recent advances in deep learning related models in Section 3. Section 4 compares convolutional neural networks (CNN)-based SISR algorithms. Section 5 presents in-depth discussions, followed by open questions for future research in Section 6. Finally, the paper is concluded in Section 7.
-
Example-based algorithms aim to enhance the resolution of LR images by learning from other LR-HR patch pair examples. The relationship between LR and HR was applied to an unobserved LR image to recover the most likely HR version. Example-based methods can be classified into two types: internal learning and external learning-based methods.
-
The natural image has a self-similarity property, which tends to recur many times within both the same scale or across different scales inside the image.
To determine the similarity, Glasner et al.[1] made a test by comparing the original image and multiple cascades of images of decreasing resolutions. A scale-space pyramid was constructed to exploit the self-similarity in given LR image, which was then used to impose a set of constraints on the unknown HR image, as shown in Fig. 1[1]. Since the dictionary is limited on the given LR-HR patch pairs, Huang et al.[2] extended the search space to both planar perspectives and affine transforms of patches to exploit abundant feature similarity. However, the most important limitation lies in the fact that self-similarity based methods lead to high complexity of computation due to huge numbers of searching and the accuracy of algorithms is highly variant according to natural properties of images.
Figure 1. Pyramid model[1] for SISR. From the bottom, when a similar patch found in a down-scale patch (yellow at level I–2), its parent (yellow at level I0) is copied to an unknown HR image with an appropriate gap in scale and support of different kernels. Color versions of the figures in this paper are available online.
-
The external learning-based methods attempt to search the similar information from other images or patches instead. It was first introduced to estimate an underlying scene X with the given image data Y [3]. The algorithm aimed to learn the posterior probability
$ P(X|Y) = \dfrac{1}{P(Y)}P(X, Y) $ , by adding image patches X and its corresponding scenes Y as nodes in a Markov network. It was then applied for generating super-resolution images, where the input image is LR and the scene to be estimated is replaced by an HR image[4].Locally linear embedding (LLE) is one of the manifold learning algorithms, based on the idea that the high dimensionality may be represented as a function of a few underlying parameters. LLE begins by finding a set of nearest neighbors of each point that can best describe that point as a linear combination of its neighbors. It is then determined to find the low-dimensional embedding of points, such that each point is still represented by the same linear combination of its neighbors. However, one of the disadvantages is that LLE handles non-uniform sample density poorly because the feature represented by the weights varied according to regions in sample densities. The concept of LLE was also applied in SISR neighbor embedding[5], where the features are learned in the LR space before being applied to estimate HR images. There were several other studies based on local linear regression such as: ridge regression[6], anchored neighborhood regression[7, 8], random forest[9], and manifold embedding[10].
Another group of algorithms that has received attention is sparsity-based methods. In the sparse representation theory, the data or images can be described as a linear combination of sparse elements chosen from an appropriately over-complete dictionary. Let
$ D \in {\bf R}^{n \times K} $ be an over-complete dictionary ($ K\gg n $ ), we can build a dictionary for most scenarios of inputs and then any new image (patch)$ X \in {\bf R}^n $ can be represented as$ X = D \times \alpha $ , where$ \alpha $ is a set of sparse coefficients. Hence, there were dictionary learning problems and sparse coding problems to optimize D and$ \alpha $ , respectively. The objective function for standard sparse coding is$ \arg\min\limits_{\substack{D}}\sum\limits_{i = 1}^{N} \arg\min\limits_{\substack{\alpha_i}} \frac{1}{2}\parallel{x_i - D \alpha_i\parallel^2} + \lambda\Vert \alpha_i\Vert . $
(1) Unlike standard sparse coding, the SISR sparsity-based method works with two dictionaries to learn the compact representation for these patch pairs. Assuming that the observed low-resolution image Y is blurred and a down-sampled version of the high-resolution X:
$ Y = S\cdot H \cdot X $
(2) where H represents a blurring filter and S the down-sampling operation. Under mild conditions, the sparest
$ \alpha_0 $ can be unique for both dictionaries because the dictionary is over-complete or very large. Hence, the joint sparse coding can be represented as$\begin{split}& \arg\min\limits_{\substack{D_x, D_y}} \sum\limits_{i = 1}^{N} \arg\min\limits_{\substack{\alpha_i}} \frac{1}{2}\parallel{x_i - D_x \alpha_i\parallel^2}+\\ &\quad\quad\quad\quad \frac{1}{2}\parallel{y_i - D_y \alpha_i\parallel^2} + \lambda\Vert \alpha_i\Vert . \end{split}$
(3) The two dictionaries of high-resolution
$ D_h $ and low-resolution$ D_l $ are co-trained to find the compact coefficients$ \alpha_h = \alpha_l = \alpha $ [11], such that sparse representation of a high-resolution patch is the same as the sparse representation of the corresponding low-resolution patch. A dictionary$ D_l $ was first trained to best fit the LR patches, then the$ D_h $ dictionary was trained that worked best with$ \alpha_l $ . When these steps were completed,$ \alpha_l $ was then used to recover a high-resolution image based on the high-resolution dictionary$ D_h $ .One of the major drawbacks of this method is that the two dictionaries are not always linearly connected. Another problem is that HR images are unknown in the testing phase, hence the equivalence constraint on the HR sparse representation does not guarantee as it has been done in the training phase. Yang et al.[12] suggested a coupled dictionary learning process to pose constraints for two spaces of LR and HR. The main disadvantage of this method is that both dictionaries are assumed to be strictly aligned to achieve alignment between
$ \alpha_h $ and$ \alpha_l $ or the simplifying assumption of$ \alpha_h $ =$ \alpha_l $ . To avoid this invariance assumption, Peleg and Elad[13] connect$ \alpha_h $ ,$ \alpha_l $ via a statistical parametric model. Wang et al.[14] proposed semi-couple dictionary learning, in which two dictionaries are not fully coupled. It was based on an assumption that there exists a mapping in sparse domain$ f(\cdot) $ :$ \alpha_l $ $ \to $ $ \alpha_h $ or$ \alpha_h = f(\alpha_l) $ . Therefore, the objective function has one additional error term$ \Vert \alpha_h - f(\alpha_l) \Vert^2 $ and other regularization terms. Beta process joint dictionary learning was proposed in [15], which enables the decomposition of these sparse coefficients to the element multiplication of dictionary atom indicators and coefficient values, providing the much needed flexibility to fit each feature space. Finally, sparsity-based algorithms have remaining limitations in feature extraction and mapping, which are not always adaptive or optimal for generating HR images. -
The convolutional neural networks (CNNs) have been developed rapidly in the last two decades. The first CNN model to solve the SISR problems is introduced by Dong et al.[16, 17], named super-resolution convolutional neural network (SRCNN). Given a training set of LR and corresponding HR images
$ {x^i,\;y^i} $ ,$i=1\cdots N $ , the objective is to find an optimal model f, which will then be applied to accurately predict Y = f(X) on unobserved examples X. The SRCNN consists of the following steps, as shown in Fig. 2[16]:1) Preprocessing: Upscale the LR image to desired HR image using bicubic interpolation.
2) Feature extraction: Extract a set of feature maps from the upscaled LR image.
3) Non-linear mapping: Maps the features between LR and HR patches.
4) Reconstruction: Produce the HR image from HR patches.
Interestingly, although only three layers have been used, the result significantly outperforms those non-deep learning algorithms discussed previously. However, it seems possible that the accuracy cannot be improved further based on this simple model. This led to the question of whether “the deeper the better” is or is not the case in super resolution (SR). Inspired by the success of very deep networks, Kim et al.[18, 19] proposed two models named very deep convolutional networks (VDSR)[18] and deeply recursive convolutional network (DRCN)[19], which both stack 20 convolutional layers, as shown in Figs. 3 (a) and 3 (b). The VDSR is trained with a very high learning rate (
$ 10^{-1} $ instead of$ 10^{-4} $ in SRCNN) in order to accelerate the convergence speed and whilst gradient clipping was used to control the explosion problem.Figure 3. VDSR, DRCN, DRRN model for SISR. The same color of yellow or orange indicates the sharing parameters.
Instead of predicting the whole image as was done in SRCNN, residual connection was used to force the model to learn the difference between inputs and outputs. The zeros were padding at borders to avoid the problem of quickly reducing feature maps through deep networks. In order to gain more benefits from residual learning, Tai et al.[20] used both global residual connections and local residual connections in deeply recursive residual networks (DRRN). The global residual learning is used in the identity branch and recursive learning in the local residual branch, as illustrated in Fig. 3(c). Mao et al.[21] proposed a 30-layer convolutional auto-encoder network, namely the residual encoder-decoder network (RED30). The convolutional layers work as a feature extractor and encode image content, while the de-convolutional layers decode and recover image details. Unlike other methods as mentioned above, the encoder reduces the feature map to encode the most important features. By doing it in this way, noise/corruption can be efficiently eliminated. Hence, this model has completed extended tests on several tasks of image restoration such as image de-noising, JPEG de-blocking, non-blind de-blurring and image in-painting[21].
Recent advances in CNN architecture such as DenseNet, Network in Network, and Residual Network have been exploited for SISR applications[22, 23]. Among them, Residual Channel Attention Network (RCAN) and SRCliqueNet have recently been the-state-of-the-art (up to 2018) in terms of pixel-wise measurement, as shown in Table 2, Section 4.
Channel attention. Each of the learned filters operates with a local receptive field and the interdependence between channels is entangled with spatial correlation. Therefore, the transformation output is unable to exploit information such as the interrelationship between channels outside the region. The RCAN[24] has been the deepest model (about 400 layers) for the SISR task. It integrated a channel attention mechanism inside the residual block, as shown in Fig. 4[24]: The input with shape of a H×W×C is squeezed into the channel descriptor by averaging through a spatial dimension of H×W to generate the output shape of 1×1×C. This channel descriptor is put through gate activation of sigmoid f and element-wise product with the input in order to control how much information from each channel is passed up to the next layer in the hierarchy.
Figure 4. Channel attention block[24]
Joint sub-band learning with clique structure – SRCliqueNet[25]. CliqueNet is newly proposed convolutional network architecture where any pair of layers in the same block are connected bilaterally, as shown in Fig. 5.
Figure 5. Clique block with two stages updated. Four layers 1, 2, 3, 4 in blocks are stacked in the order of 1, 2, 3, 4, 1, 2, 3, 4 and bilaterally connected by the residual shortcut. It has more skip connection compared with the Densenet block.
The Clique block encourages the features to be refined, which provides more discrimination and leads to a better performance. Zhong et al.[25] proposed Super-Resolution CliqueNet, which applied this architecture to jointly learned wavelet sub-band in both the feature extraction stage and sub-band refinement stage.
Concatenation for feature fusion rather than summation – RDN[26]. As the model goes deeper, the feature in each layer would be hierarchical with different receptive fields. The information from each layer may not be fully used by recent methods. Zhang et al.[26] proposed concatenated operations on the DenseNet to build hierarchical features from all layers, as shown in Fig. 6.
Figure 6. Residual dense block[26]. All previous feature are concatenated to build hierarchical features.
Wide activation in residual block – Wide-activated deep super-resolution network (WDSR)[27]. The efficiency and higher accuracy image resolution can be achieved with fewer parameters than that of enhanced deep super-resolution network (EDSR) by expanding the number of channels by a factor of
$ \sqrt{r} $ before rectified linear unit (RELU) activation in residual blocks. As such, the residual identity mapping path slimmed as a factor of$ \sqrt{r} $ to maintain constant output channels.Cascading residuals to incorporate the features from multiple layers – Cascading Residual Network (CARN)[28]. The most interesting finding was that there are similar mechanisms in MemNet (Section 3.2), RDN and CARN models. In addition to the ResNet architecture, they all use 1 × 1 convolution as a fusion module to incorporate multiple features from previous layers. Their results boost the performance effectively and can be considered in model design.
Information distillation network – IDN[29]. The IDN model uses the distillation block, which combines an enhancement unit with a compression unit. In this block, the information is distilled inside the block before it passes to the next level.
When we use neural networks to generate images, it usually involves up-sampling from low resolution to high resolution. One of the problems with the use of interpolation-based methods is that it is predefined and there is nothing that the network can learn about. This method is also being criticized for high computational complexity while computing in HR space without additional information. On the other hand, transposed convolution and PixelShuffle concepts have learnable parameters for optimally up-sampling the input. It provides flexible up-sampling and can be inserted at any place in the architecture. Lai et al.[30] proposed Laplacian Pyramid super-resolution networks (Lap-SRN) to reconstruct the image progressively. In general, the Laplacian Pyramid scheme decomposes an image as a series of high-pass bands and low-pass bands. At each level of reconstruction, a transposed convolution was used to up-sample the image in both the high-pass branch and low-pass branch. Beside the Laplace decomposition, Wavelet transform (WT) has been shown to be an efficient and highly intuitive tool to represent and store images in a multi-resolution way. WT can describe the contextual and textural information of an image at different scales. WT for super-resolution has been applied successfully to the multi-frame SR problem. However, conventional discrete wavelet transformation reduces the image size by a factor of 2n, which is inconvenient when testing images are of a certain size. It is proposed by Asamwar et al.[31] to reduce the image to any (variable scale) size, using discrete wavelet transformation.
For comparison, most SISR algorithms have been performed on the LR image, which was downsampled with scaling factors of 2x, 3x, 4x from the HR image. Otherwise, features available in the LR space have not sufficed for learning. It is suggested that a training model for high upscaling factors can benefit from the pre-trained model on lower upscaling factors[32]. In other words, it can be described as transfer learning. Wang et al.[33] proposed a progressive asymmetric pyramidal structure to adapt with multiple upscaling factors and up to a large scaling factor of 8x. Also, a deep back projection network[34] using mutually connected up-sampling and down-sampling stages has been used for reaching such high up-scaling factors. These experiments support recommendations to use progressive up-sampling or iterative up and down-sampling when reconstructing SR images under larger scaling factors.
When assuming a low-resolution image is downsampled from the corresponding high-resolution image, CNN-based methods ignored the true degradation such as noise in real world applications. Zhang et al.[35] proposed super-resolution multiple degradation (SRMD) training on LR images, synthesizing with three kinds of degradations: a blur kernel, bicubicly downsampling followed by additive white Gaussian noise (AWGN). Obviously, to learn invariant features, this model had to use large training datasets of approximate 6 000 images. Shocher et al.[36] observed strong internal data repetition in the natural images, which is similar to that in [1]. The information for tiny objects, for example, is better to be found inside the image, other than in any external database of examples. A "Zero Shot" SR (ZSSR)[36] was then proposed without relying on any prior image examples or prior training. It exploits cross-scale internal recurrence of image-specific information, where the test image itself is trained before being fed again to the resulting trained network. Because little research has been focused on variant degradations of SISR, more evaluations and comparisons are required and further investigations would be of great help.
-
A ResNet with weight sharing can be interpreted as an unrolled single-state recurrent neural network (RNN)[37]. A dual-state recurrent network (DSRN)[38] allows that both the LR path and HR path caption information at different spaces and are connected at every step in order to contribute jointly to the learning process, as shown in Fig. 7[38]. However, the average of all recovered SR images at each stage may have a deteriorated result. Another reason is that the down-sampling operation at every stage can lead to information loss at the final reconstruction layer.
Figure 7. Dual state model[38]. The top branch operates on the HR space, where the bottom branch works on the LR space. A connection from LR to HR using de-convolution operation; a delayed feedback mechanism is to connect previous predicted HR to LR at the next stage.
In the view of memory in RNNs, CNNs can be interpreted as: short-term memory. The conventional plain CNNs adopts a single path feed-forward architecture, in which a latter feature is influenced by a previous state. Limited long-term memory: When the skip connection is introduced, one state is influenced by a previous state and specific point prior state. To enable the latter state to see more prior states and decide whether the information should be kept or discarded, Tai et al.[39] proposed a memory network (MemNet), which uses recursive layers followed by a memory unit to allow the combination of short and long-term memory for image reconstruction, as shown in Fig. 8[39]. In this model, a gate unit controls information from the prior recursive units, which extracts features at different levels.
Figure 8. Memory block in MemNet[39] includes multiple recursive units and a gate unit MemNet model
Unlike convolutional operations, which capture features by repeatedly processing local neighborhoods of pixels, the non-local operation describes a pixel as a combination of weighted distance to all other pixels, regardless of their positional distance or channels. Non-local means to provide an efficient procedure for image noise reduction; however, the local and non-local based methods are treated separately, thereby not taking account of their advantages. The non-local block was introduced in [40], enabling integrate non-local operation into end-to-end training with local operation based models such as CNNs. Each pixel at point
$ i $ in an image can be described as$ y_i = \frac{1}{C(x)}\sum\limits_{j \in \varOmega}^{}f(x_i, x_j)g(x_j) $
(4) where
$ f(x_i, x_j) ={\rm e}^{\Theta(x_i)^{\rm T}\varphi(x_j)} $ is a weighted function, measuring how closely related the image at point$ i $ is to the image at point$ j $ . Thus, by choosing$ \Theta(x_i) = W_\Theta x_i $ ,$ \varphi(x_j) = W_\varphi x_j $ and$ g(x_j) = W_g x_j $ , the self-similarity can be jointly learned in embedding the space by following blocks, as shown in Fig. 9[40].Figure 9. A non-local block[40]
For SISR tasks, Liu et al.[41] incorporated this model into the RNN network by maintaining two paths: a regular path, that contains convolution operations on image, and the other path that maintains non-local information at each step as input branches in the regular RNNs structure. However, non-local means it has disadvantage that remarkable denoising results are obtained at a high expense of computational cost due to the enormous amount of weighting computations.
-
Generative adversarial network (GAN) was first introduced in [42], targeting the minimax game between a discriminative network D and a generative network G. The generative network G takes the input z
$ \sim $ p(z) as a form of random noise, then outputs new data G(z), whose distribution$ p_g $ is supposed to be close to that of the data distribution$ p_{\rm data} $ . The task of the discriminative network D is to distinguish a generated sample G(z)$ \sim $ $ p_g $ (G(z)) and the ground truth data sample x$ \sim $ $ p_{\rm data} $ (x). In other words, the discriminative network determines whether the given images are natural-looking images or they look like artificial created images. As the models are trained through alternative optimization, both networks are improved until they reach a point called Nash Equilibrium that fake images are indistinguishable from real images. The objective function is represented as$\begin{split} &\min\limits_{G} \max\limits_{D} E_{x \sim p_{data}} [\log D(x)] + E_{z \sim p_z} [\log(1-D(G(z)))]=\\ &\quad \min\limits_{G} \max\limits_{D} E_{x \sim p_{data}} [\log D(x)] + E_{x \sim p_z} [\log(1-D(x))].\end{split}$
(5) This concept is consistent with the problem solving in image super resolution. Ledig et al.[43] introduced the super-resolution generative adversarial network (SRGAN) model, of which a generative network upsamples LR images to super resolution (SR) images and the discriminative network is to distinguish the ground truth HR images and SR images. A pixel-wise quality assessment metric has been critical of showing poorly to human perception. By incorporating newly adversarial loss, the GAN-based algorithms have solved the problem and produced highly perceptive, naturalistic images, as can be seen from Fig. 10[43].
Figure 10. From left to right, image is reconstructed by bicubic interpolation, deep residual network (SRResNet) measured by MSE, SRGAN optimize more sensitive to human perception, and original image. Corresponding PSNR and SSIM are provided on top. The zoom of red rectangles are shown at right bottom.
The GAN-based SISR model has been developed further in [44, 45], which has resulted in an improved SRGAN by fusion of pixel-wise loss, perceptual loss, and newly proposed texture transfer loss. Park et al.[46] proposed SRFeat and employed an additional discriminator in the feature domain. The generator is trained through two phases: pre-training and adversarial training. In the pre-training phase, the generator is trained to obtain high PSNR by minimizing MSE loss. The training procedure focuses on improving perceptual quality using perceptual similarity loss (Section 5.2.2), GAN loss in pixel domain and GAN loss in feature domain. Perhaps the most serious disadvantage of GAN-based SISR methods is difficulties in the training models, which will be further discussed in Section 5.2.
-
In order to provide a brief overview of the current performance of deep learning-based SISR algorithms, we compare some recent work in Tables 1 and 2. Two image quality metrics have been used for performance evaluation: A peak signal-to-noise ratio (PSNR) and a structural SIMlarity (SSIM) index. The higher the PSNR and SSIM, the better quality of the image being reconstructed. The PSNR can be described as
Scale Set5 PSNR/SSIM Set14 PSNR/SSIM B100 PSNR/SSIM Urban100 PSNR/SSIM SRCNN 2 36.66/0.954 2 32.45/0.906 7 – – 3 32.75/0.909 0 29.30/0.821 5 – – 4 30.49/0.862 8 27.50/0.751 3 – – VDSR 2 37.53/0.958 7 33.03/0.912 4 31.90/0.896 0 30.76/0.914 0 3 33.66/0.921 3 29.77/0.831 4 28.82/0.797 6 27.14/0.827 9 4 31.35/0.883 8 28.01/0.767 4 27.29/0.725 1 25.18/0.752 4 DRCN 2 37.63/0.958 8 33.04/0.911 8 31.85/0.894 2 30.75/0.913 3 3 33.82/0.922 6 29.76/0.831 1 28.80/0.796 3 27.15/0.827 6 4 31.53/0.885 4 28.02/0.767 0 27.23/0.723 3 25.14/0.751 0 DRRN 2 37.74/0.959 1 33.23/0.913 6 32.05/0.897 3 31.23/0.918 8 3 34.03/0.924 4 29.96/0.834 9 28.95/0.800 4 27.53/0.837 8 4 31.68/0.888 0 28.21/0.772 0 25.44/076 34 25.44/0.763 8 RED30 2 37.66/0.959 9 32.94/0.914 4 – – 3 33.82/0.923 0 29.61/0.834 1 – – 4 31.51/0.886 9 27.86/0.771 8 – – MemNet 2 37.78/0.959 7 33.28/0.914 2 32.08/0.897 8 31.31/0.919 5 3 34.09/0.924 8 30.00/0.835 0 28.96/0.800 1 27.56/0.837 6 4 31.74/0.889 3 28.26/0.772 3 27.40/0.728 1 25.50/0.763 0 LapSRN 2 37.52/0.959 0 33.08/0.913 0 31.80/0.895 0 30.41/0.910 0 3 – – – – 4 31.54/0.885 0 28.19/0.772 0 27.32/0.728 0 25.21/0.756 0 Zero Shot 2 37.37/0.957 0 33.00/0.910 8 – – 3 33.42/0.918 8 29.800.830 4 – – 4 31.13/0.879 6 28.01/0.765 1 – – EDSR 2 38.20/0.960 6 34.02/0.920 4 32.37/0.901 8 33.10/0.936 3 3 34.77/0.929 0 30.66/0.848 1 29.32/0.810 4 29.02/0.868 5 4 32.62/0.898 4 28.94/0.790 1 27.79/0.743 7 26.86/0.808 0 IDN 2 37.83/0.960 0 33.30/0.914 8 32.08/0.898 5 31.27/0.919 6 3 34.11/0.925 3 29.99/0.835 4 28.95/0.801 3 27.42/0.835 9 4 31.82/0.890 3 28.25/0.773 0 27.41/0.729 7 25.41/0.763 2 CARN 2 37.76/0.959 0 33.52/0.916 6 32.09/0.897 8 31.92/0.925 6 3 34.29/0.925 5 30.29/0.840 7 29.06/0.803 4 28.06/0.849 3 4 32.13/0.893 7 28.60/0.780 6 27.58/0.734 9 26.07/0.783 7 RDN 2 38.30/0.961 6 34.10/0.921 8 32.40/0.902 2 33.09/0.936 8 3 34.78/0.930 0 30.67/0.848 2 29.33/0.810 5 29.00/0.868 3 4 32.61/0.900 3 28.92/0.789 3 26.82/0.806 9 26.82/0.806 9 SRCliqueNet+ 2 38.28/0.963 0 34.03/0.924 0 32.40/0.906 0 32.95/0.937 0 3 – – – – 4 32.67/0.903 0 28.95/0.797 0 27.81/0.752 0 26.80/0.810 0 RCAN+ 2 38.27/0.961 4 34.23/0.922 5 32.46/0.903 1 33.54/0.939 9 3 34.85/0.930 5 30.76/0.849 4 29.39/0.812 2 29.31/0.873 6 4 32.73/0.901 3 28.98/0.791 0 27.85/0.745 5 27.10/0.814 2 Table 2. Quantitative evaluation of the-state-of-the-art SR algorithm. Average PSNR/SSIM for scale factor 2x, 3x, 4x. Red text indicates that the best and blue text indicates the second best performance.
Models Input Type of network Number of params Mult-adds Reconstructions Train data Loss function SRCNN LR + Bicubic Supervised 8 K 52.7 G Direct Yang91 L2(MSE) VDSR LR + Bicubic Supervised 666 K 612 G Direct G200+Yang91 L2 DRCN LR + Bicubic Supervised 1, 775 K 17 974 G Direct Yang91 L2 DRRN LR + Bicubic Supervised 297 K 6 796 G Direct G200+Yang91 L2 RED30 LR + Bicubic Supervised 4, 2 M – Direct BSD300 L2 LapSRN LR Supervised 812 K 29.9 G Progressive G200+Yang91 Charbonnie MemNet LR + Bicubic Supervised 677 K 2 662 G Direct G200+Yang91 L2 Zero-Shot LR + Bicubic Unsupervised 225 K – Direct – L1(MAE) Dual State LR + Bicubic Supervised 1, 2 M – Progressive Yang91 L2 SRGAN LR Supervised – – Direct ImageNet L2 + Perceptual loss EDSR LR Supervised 43 M 2 890 G Direct DIV2K L1 IDN LR Supervised 677 K – Direct G200+Yang91 L1 CARN LR Supervised 1, 6 M 222 G Direct DIV2K+Yang91+B200 L1 RDN LR Supervised 22.6 M 1 300 G Direct DIV2K L1 SRCliqueNet+ LR Supervised – – Direct DIV2K+Flickr L1 + L2 RCAN+ LR Supervised 16 M – Direct DIV2K L1 Table 1. Comparison of different SISR models
$ {\rm PSNR} = 10\log_{10} {\frac{{\rm 255}^2}{{\rm MSE}}} $
(6) where MSE is mean squared error between two images of
$ I_1 $ and$ I_2 $ :$ {\rm MSE} = \frac{\sum_{M, N}[I_1((m, n)-I_2(m, n)]^2}{M\times N}.$
(7) Here, M and N are the number of rows and columns in the input images, respectively. Equation (6) shows that minimizing
$ L_2 $ loss tends to maximizing the PSNR value.Table 1 summarizes the detailed performance comparison of some typical deep learning based SISR models, including SRCNN[17], VDSR[18], DRCN[19], DRRN[20], RED30[21], RCAN[24], SRCliqueNet[25], RDN[26], CARN[28], IDN[29], LapSRN[30], EDSR[32], Zero Shot[36], and MemNet[39]. The detailed performance comparison of those models is presented in Table 2. The four standard benchmark datasets are used including SET5[47], SET14[48], B100[49], URBAN100[2] which are popularly used for comparison of SR algorithms. The down-sampling scale factor used include 2x, 3x, and 4x, and missing information that was not provided by the authors is marked by [-]. All quantitative results are duplicated from the original papers.
From Table 1, Table 2 and Fig. 11, CARN stand out through their high accuracy using small model. SRCliqueNet+ and RCAN+ achieved higher accuracy in comparison with EDSR in term of PSNR/SSIM measurement whilst requiring smaller model size. GAN-based models are in favour of perceptual reconstruction, which we do not include in Table 2 and Fig. 11.
-
Generally, when a random variable X has been observed, the aim is to predict the random variable Y as the output of the network. Let g(X) be the predictor, clearly we would like to choose g so that g(X) tends to be close to Y via the maximum likelihood estimation (MLE). One possible criterion for closeness is to choose g to minimize
$ E[(Y-g(X))^2] $ , thus the optimal predictor of Y becomes$ g(X) = E[Y|X] $ as the mean conditional expectation of Y given X. Most of the objective functions originally comes from MLE and we will show that the typical objective functions below are special cases of MLE. -
By using CNNs, the mapping between a pair of corresponding LR and HR images is non-linear. The classical content loss function for the regression problem are LAD (least absolutes deviations) (or
$ L_1 $ ) and LSE (least squared errors) (or$ L_2 $ ) defined as$ L_1 = \sum\limits_{i = 1}^n|\hat{y} - y| $
(8) $ L_2 = \sum\limits_{i = 1}^n(\hat{y} - y)^2 $
(9) where the estimation of y can be defined as
$ y = W^{\rm T} x $ and$ {\hat{y}}$ is the ground truth. This objective function is to minimize the cost function with regard to the weight matrix W. If we could write the regression target as$ {\hat{y}}= {y} + \xi $ and the model regression target as a Gaussian random variable y$ \sim N(\mu,\;\sigma^2) $ with$ \mu = $ y$ = W^{\rm T} x $ , the prediction model is$\begin{split} P(\hat{y}|x, W) & = N(\hat{y}|W^{\rm T} x, \sigma^2)= \\ & \frac{1}{\sqrt{2 \sigma^2 \pi}}\exp\left(-\frac{(\hat{y} - W^{\rm T} x)^2}{2\sigma^2}\right) \end{split}$
(10) then, the optimum W can be determined by using the maximum likelihood estimation (MLE):
$\begin{split}W_{\rm MLE}& = \arg\max\limits_{\substack{W}} N(\hat{y}|W^{\rm T} x, \sigma^2) =\\ & \arg\max\limits_{\substack{W}}\exp\left(-\frac{(\hat{y} - W^{\rm T} x)^2}{2\sigma^2}\right).\end{split}$
(11) Taking the logarithm of the likelihood function, and making use of the standard form (
$ \sigma = 1 $ ), we obtain the objective function:$ W_{\rm MLE} = \arg\min\limits_{\substack{W}}\frac{1}{2}(\hat{y} - W^{\rm T} x)^2 $
(12) which is equal to the minimum the loss function
$ L_2 $ in (9). In other words, least square estimate is actually the same as the maximum likelihood estimate under a Gaussian model. We have to replace the$ L_2 $ loss function with$ L_1 $ loss:$ E[(Y-g(X)] $ as mentioned previously, the solution is$ g(x) = median(Y|X) $ , which is also a solution for MLE. It is important to bear in mind that the assumption is for uni-modal distribution with a single peak, which will not work well to predict multi-modal distributions. Another problem with content loss is that a minor change in pixels, for example shifting, can lead to a dramatically decreased PSNR. This problem has been mentioned in our previous work[50] with experimental results. -
A key relationship between images and statistics is that we can interpret images as samples from a high-dimensional probability distribution. The probability distribution goes over the pixels of images and is what we use to define whether an image is natural or not. This is when a Kullback-Leibler (KL) divergence measurement comes into place. It measures the difference between two probability distributions, which is different from the Euclidean distance, i.e.,
$ L_1, L_2 $ loss. It may be tempting to think of it as a distance metric, but, we cannot use KL divergence to measure distance between two distributions because it is not symmetric. Given two distribution$ P_{data} $ and$ P_{model} $ , the forward KL Divergence can be computed as follow:$\begin{split}& D_{KL}[P_{x|data}||P_{x|model}] = E_{x \sim P_{data}}\log \frac{P_{x|data}}{P_{x|model}}=\\ &\quad E_{x \sim P_{data}}[\log P_{x|data}] - E_{x \sim P_{data}}[\log P_{x|model}].\end{split}$
(13) The left term is entropy of
$ P_{x|data} $ which is dependent on the model and thus can be ignored. If we sample N of$ x \in P_{x|data} $ when N goes to infinity, following by the law of large numbers we have$-\frac{1}{N}\sum\limits_{i}^n \log P(x_i|model) = -E_{x \sim P_{x|data}}[P(x|model)] \!\!\!$
(14) where the right term is negative log-likelihood. The minimum Kullback-Leibler divergence is also equivalent to the maximum the Log likelihood.
When
$ P_{model} = P_{data} $ the KL divergence comes to the minimum 0. It is assumed that human observers learn$ p_{data} $ as a natural distribution or a kind of prior belief. The GAN-based model is to encourage reconstructed images to have similar distributions to the ground truth images, which refer to adversarial loss as part of the perceptual loss in SRGAN[43]. Adversarial learning is actually useful when facing the complicated manifold distributions in natural images. However, training a GANs-based model is elusive due to several drawbacks:1) Hard to achieve Nash Equilibrium[51]: According to game theory, the GANs-based model converges when the discriminator and generator reach a Nash Equilibrium. However, updating each model with no respect to each other cannot guarantee the convergence. Both models can reach a state when the action of each model does not matter to each other.
2) Vanishing problem[52]: As given in (5), when the discriminator knows better we can assume that
$ D(x) = 1,\;\forall x \in p_{data} $ and$ D(x) = 0,\;\forall x \in p_{p_z} $ and the loss function falls to 0 and ends up with a vanishing gradient. As a result, the learning is super slow and even jammed. Conversely, when the discriminator behaves badly, the generator does not give accurate feedback.3) Mode collapse[53]: a generator generates a limited diversity of samples, or even the same sample regardless of the input. We have demonstrated that L1 and L2 loss are special cases of MLE and further KLD is equivalent to MLE. This finding leads to a question whether there exists another effective representation of MLE which is a better representation for image super resolution.
-
The MSE in feature space is to compare two images based on high-level representations from pre-trained convolutional neural networks (trained on image classification tasks, e.g., the ImageNet Dataset, as given in Fig. 12).
Figure 12. Model structure for calculating perceptual loss[45]
Given an input image x, Image Transform Net transforms it into the output image
$ \hat{y} $ . Rather than matching the pixels of output image to the pixels of the target image, they were encouraged to have similar feature represents as measured by loss network. The perceptual loss was defined by computing MSE between later set of activations, particularly in applied super-resolution or style transfer. In practice, we can combine different kinds of loss functions, but, each loss function mentioned has a particular property. There is not a single loss function that works for all kinds of data. -
Despite the success of deep learning for SISR tasks, there are open research questions regarding SISR model design as discussed below:
1) Need for light structure model: Although deeper is better, most recent SISR models contain no more than a hundred layers due to the overfitting problem. This is because SISR models work on pixel level, which requires many more parameters than that of image classification. As the model is getting deeper, the vanishing gradient is becoming more challenging. This suggests the preference of a light structure model with fewer parameters and less computation.
2) Adapt well to unknown degradation: Most algorithms highly depend on predetermined assumptions that LR images are simply down-sampling from HR images. They were unsuccessful in recovering SR images with big scale factors due to the lack of learnable features on LR images. If noise is present, the accuracy of reconstruction is deteriorated as a result of the increasing ill-posed problems. A good way to feasibly deal with unknown degradation is to use transfer learning or a huge number of training examples. However, there has been little research on this task hence this needs be further investigated.
3) Requirement for different assessment criteria: No methods can achieve low distortion and good perceptual quality at the same time. The traditional measurements such as L1/L2 loss can help to generate images with low distortion, but there is still considerable disagreement with regard to human perception. In contrast, the integration of perceptual assessment produces more realistic images, but it suffers from low PSNR. Therefore, it is necessary to extend more criteria of assessment for particular applications.
4) Efficiently interpret and exploit prior know- ledge to reduce ill-posed problems: Until recently, the deep architecture appears like a black box and we have limited knowledge of why it works and how it works. Meanwhile, most SISR algorithms have introduced different structures or connections based on the experiments, neglecting to explain further on why the result is improved. Another important solution for ill-posed problems is to combine different constraints as regulizers for prediction. For example, the combination of different loss functions, or the use of image segmentation information to constrain reconstructed images. That is why a semantic categorical prior[54] was introduced, attempting to achieve richer and more realistic textures. The simple ways to use more prior knowledge are that we can use MLE as a proxy to incorporate prior knowledge as conditional probability or feed directly into the network whilst forcing parameters sharing for all kinds of inputs.
-
This survey has reviewed key papers in single image super-resolution that underly example-based learning methods. Among these, we noticed that deep learning based methods have recently achieved state-of-the-art performance. Before going into more detail of each algorithm, the general background in each of the categories was introduced. We have highlighted important contributions of these algorithms, discussed their pros and cons and suggested future work possible either within categories or in designated sections. Up to now, we cannot define which SISR algorithms are the most state-of-the-art, as this is highly dependent on applications. For instance, an algorithm which is good for medical imaging or facing processing purposes is not necessarily effective for remote sensing images. The different constraints imposed in a problem indicates a need to generate a benchmark database that specifies the concerns of applications in different fields. Finally, there are outstanding challenges to exploit algorithms in practical applications since they have been mainly applied to standard benchmark datasets and poorly adapted to different scenarios. This survey paper has enhanced the understanding of deep learning based algorithms applied to single image super-resolution, which can be used as a comprehensive guide for the beginner and throws up many questions in need of further investigation.
-
The authors would like acknowledge the support from the Shanxi Hundred People Plan of China and colleagues from the Image Processing Group in Strathclyde University (UK) , Anhui University (China) and Taibah Valley (Taibah University, Saudi Arabia) respectively, for their valuable suggestions.
Deep Learning Based Single Image Super-resolution: A Survey
- Received: 2019-01-10
- Accepted: 2019-04-19
-
Key words:
- Image super-resolution /
- convolutional neural network /
- high-resolution image /
- low-resolution image /
- deep learning
Abstract: Single image super-resolution has attracted increasing attention and has a wide range of applications in satellite imaging, medical imaging, computer vision, security surveillance imaging, remote sensing, objection detection, and recognition. Recently, deep learning techniques have emerged and blossomed, producing " the state-of-the-art” in many domains. Due to their capability in feature extraction and mapping, it is very helpful to predict high-frequency details lost in low-resolution images. In this paper, we give an overview of recent advances in deep learning-based models and methods that have been applied to single image super-resolution tasks. We also summarize, compare and discuss various models from the past and present for comprehensive understanding and finally provide open problems and possible directions for future research.
Citation: | Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero and Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413-426, 2019. doi: 10.1007/s11633-019-1183-x |