Intrinsic image decomposition based on scale space transformation

Summary

We introduce a new network structure for decomposing an image into its intrinsic reflection and illumination images. We treat it as an image-to-image translation problem and decompose the input and output in scale space. By expanding the output images (reflected and illuminated) to their Laplacian components, we develop a multi-channel network structure that learns an image-to-image transfer function in parallel within each channel. , this function is represented by a convolutional neural network with skip connections. The network structure is general and scalable, and has shown excellent performance on intrinsic image decomposition problems. We evaluate the network on two benchmark datasets: the MPI-Sintel dataset and the MIT Intrinsic Images dataset. Both quantitative and qualitative results show that our model achieves a clear improvement over the previous state-of-the-art.

1:Introduction

A recent emerging trend in representation learning is learning to decompose individual components from images to account for input dimensions such as illumination, pose, and attributes. However, one of the preliminary forms of this problem, namely decomposing an image into its intrinsic reflected and illuminated images, has not received enough attention. A solution to the intrinsic image decomposition problem would enable material editing, provide cues for depth estimation, and provide a computational explanation for the long-standing problem of brightness constancy in human perception. However, even with the exciting progress that has been made so far, this problem remains a daunting task that we need to continue to work on.

Part of the difficulty comes from the uncertainty of this question. Based on prior knowledge of reflected and illuminated images, the Retinex algorithm restricts the decomposition to a thresholding problem in the gradient domain. This model has practical effects, but cannot handle complex materials or sharp edges of geometry, or shadows cast by strong point lights. Another part of the challenge is the complexity of the image rendering process, which is the process of converting scene materials, geometry and lighting into a 2D image through a complex optical process. The purpose of intrinsic image decomposition is to attempt to partially reverse this process.

In our work, we use a deep neural network as a function approximator to learn the mapping relation to handle the intrinsic image decomposition process in the image-to-image translation framework. Although models with similar ideas have been proposed (e.g. [1, 2]), our model explores the scale space of network inputs and outputs, and subband transformations by extending the function approximation pipeline horizontally to parallel sets.

The main contribution of our work is to propose a generative network based on scale-space decomposition for intrinsic image decomposition. We achieve this by using learnable up/down samplers to build classical Gaussian and Laplacian pyramid structures. Our final model is a composite network that generates scale-by-scale decompositions of the output reflection and illumination images; each scale decomposition is predicted by a sub-network, and the results of these sub-networks are combined into our final result.

We also propose a new loss function, which can effectively preserve image details, guarantee the smoothness of reflection images and the unique properties of illumination. We further propose a data augmentation method to combat the scarcity of labeled data - inspired by Breeder learning, we use a pre-trained network to generate labels for unlabeled pictures, and then apply certain perturbations to the pictures To generate a new dataset, and finally use the generated dataset to improve the performance of our network.

We have evaluated our model on the MPI-Sintel dataset and the MIT intrinsic image dataset. Experimental results demonstrate the effectiveness of the proposed model. Our final model outperforms previous methods significantly on various evaluation metrics.

2: Our approach

Let us first consider the transformation of an input image I to an output image A as a complex, highly nonlinear and pixel-level non-local mapping function I→f(I). It has been well demonstrated that deep convolutional neural networks are general and practical parameterization and optimization frameworks for various mappings, from image classification to image-to-language translation. Now, let's consider how to adapt the network architecture to the image-to-image translation problem, where both input and output are images with a natural level-of-detail (LOD) pyramid structure, and the mapping function links the input to the output possibly also according to the pyramid hierarchy Perform multi-channel decomposition. In the next section (2.1), we describe the model reform process starting from the ResNet architecture, which uses this property for our final multi-channel layered network architecture.

We write the Gaussian pyramid of an image I as [I0, I1, ..., IK], where I0 = I and K is the total number of layers. We denote the k-th Laplacian pyramid layer by Lk(I)=Ik – u(k+1), where u is the upsampling operator. By definition, the Laplacian pyramid expansion of an image is I = [L0(I), L1(I), ..., Lk−1(I), IK ], where L0(I) scales and IK is a Gaussian pyramid The lowest scale layer defined in .

2.1 Evolution of network structure

First, let us model the mapping I→f(I) for the low frequency band using a simplified network of two blocks (L and H): L and H handle the mapping in the high frequency band as well as any residuals. These networks are given by L omitted. By skip-connecting and summing the output of L with the output of H, the network is an instance of the ResNet architecture.

Next, by applying the Laplacian pyramid expansion on the output, we can split the loss of (a) into two components: the output of L is clipped to fit the low-frequency Gaussian component, while the output of H is fitted separately by Laplacian Detail component (Fig. 2-b). This reformed network is equivalent to (a) but with stricter restrictions.

A key transition is from (b) to (c) - because by connecting the output of L to the output of H, the two stacked blocks can be reconnected into two parallel branches, and the adjustment loss is on H Correspondingly. The resulting network structures (c) are equivalent to (b) - they represent two equivalent forms of the Laplace decomposition equation, i.e. by moving the remaining components from the left-hand side to the right-hand side and changing the sign. The loss of L in (c) remains the same as the regularized form, and our experiments find that it is optional. The network structure (d) is the basis of our final extended model.

d, For this similar to the Laplacian pyramid decomposition structure, we introduce multiple sub-network blocks H0, H1, ... HK-1 for high frequency bands and one sub-network block LK for low frequencies: network input block Downsampling is cascaded, and the outputs of the network blocks are upsampled and aggregated from left to right to form the target output. All parameters of the downsampling and upsampling operators (grey shaded trapezoids in Fig. 2) are learned in the network. All network blocks share the same architectural topology, which we call "residual blocks" and is described in detail in Section 2.2.

2.2 Residual block

Residual blocks are end-to-end convolutional subnetworks that share the same topology and transform inputs of different scales into corresponding Laplacian pyramid components. Each residual block consists of 6 consecutive concatenated Conv(3x3)-ELU-Conv(3x3)-ELU substructures. Since we predict the pixel values of the input image, fully connected layers are not used. We adopt the skip connection scheme popular in recent research, which includes some variants of the DenseNet architecture of Huang et al. Specifically, in each substructure, the output of the last Conv is accumulated per element by a skip connection, and the result is input to the last ELU unit. The middle layer has 32 feature channels, and the output is a 3-channel image or residual image. A 1x1 Conv is added to the skip connection paths of the first and last layer for size expansion/reduction to match the output of the remaining paths.

We use Exponential Linear Unit (ELU) as our activation function instead of ReLU and batch normalization, because ELU can generate negative activation values when x < 0, and has zero-mean activation, both of which can improve the Robustness to noise, ensemble training as our network gets deeper. In addition, we remove the BN layer because it can be partially replaced by ELU, which is 2 times faster and more efficient in memory usage.

2.3 Loss function

Data loss: Data loss defines the pixel-level similarity between the predicted image and the real one. We adopt the following joint bilateral filtering (also known as cross bilateral filtering [3, 4]), combined with the constraint that the product of the predicted albedo and shading should match the input, instead of using pixel-wise mean square error.

Perceptual Loss: The high-level semantic structure should also be preserved during conversion, so a CNN feature-based perceptual loss is used. We exploit the standard VGG-19 [5] network to extract semantic information from neural activations. Our perceptual loss is defined as follows:

Variational Loss: Finally, we smooth the output using a variational term.

where i and j are image row and column indices.

2.4 Data augmentation training

In this section, we describe a data augmentation strategy for incorporating unlabeled images into the self-augmentation network training process. We draw inspiration from our work in breeding studies. The idea is to employ a forward generative model to generate new training pairs for the model, augmented by perturbing the parameters produced by the model. This mechanism carries the spirit of Boostrap to a certain extent and has proven to be quite effective. For example, Li et al. recently applied this strategy to appearance modeling networks by generating training images based on the model's predicted reflectance for unlabeled images.

We start with a naive network, trained on a moderately sized dataset with ground truth reflectance and illumination images. We then apply the network to a new set of images and obtain estimated albedo A and shading S. Through a simple synthesis procedure, we can generate new images from estimates. Note that according to our loss definition, A and S are not strictly constrained to exactly match the input image, so the newly synthesized image will deviate from the original image.

To introduce further perturbations in the augmented dataset, we additionally apply adaptive manifold filters to A and S, and use the filtered results to synthesize new data. The AMF filter operator suppresses noise or unwanted details in A and S that may come from the input image or produced by the precocious network, and is used to adjust the diversity of the newly synthesized image and its ground truth.

3: Experiment

In this section, we describe model evaluation on the MPI-Sintel dataset and the MIT Intrinsic Images dataset.

3.1 Dataset

The MPI-Sintel dataset consists of 18 scene-level computer-generated image sequences, 17 of which contain 50 scene images and one contains 40 images. The ResynthSintel version is used in our experiments because the data satisfies the constraint of A×S = I. Two types of train/test splits (scene segmentation and image segmentation) are used for one-by-one comparison with previous work. Scene segmentation splits the dataset at the scene level, where half of the scenes are used for training and the other for testing. Image segmentation randomly picks half of the images for training/testing regardless of their scene category. The original version of the MIT Intrinsic dataset has 20 different images of subjects in a laboratory setting, each with 11 different lighting conditions.

3.2 Experimental results of MPI dataset

The evaluation results on the MPI-Sintel dataset are shown in the figure below. Likewise, our model produces satisfactory results over previous methods, especially in scene segmentation tests where the network is less prone to "overfit" to the test data.

Comparison with previous work: We first compare our model with a series of previous methods, including two simple baselines Constant Shading and Constant Albedo, some traditional methods, and the latest state-of-the-art neural network-based models. Results show that our The model with or without data augmentation training has the best performance under the evaluation of all three indicators.

We wish to point out the fact that the quantitative results of all methods for Sintel image segmentation can be somewhat misleading. This is because the image sequences of the same scene category in the Sintel dataset are very similar to each other, so by splitting all the data at the image level (images of the same scene type may appear in the train and test sets), the trained network on the entire training set will still be in "Good performance" on the image segmentation test set. But the scene segmentation dataset does not have this problem. An interesting result from our experiments is that our result margins for the previous result are larger in scene segmentation than in image segmentation. In the table, although we preserve a fairly modest margin on image segmentation, our margin preservation on scene segmentation is as high as 25% in si-MSE and 43% in DSSIM, suggesting that our network structure is more It still performs well on challenging datasets.

From Sequential to Parallel Structure: An important reform of network architecture that we describe in Section 2.1 is from sequential structure to multi-branch parallel structure. This reformation treats a deeply stacked network smooth as a set of parallel channels, thus alleviating the gradient backpropagation problem, propagation. This row (Ours Sequential) shows the results through the sequential architecture (a) in Figure 2. It shows that the architecture yields comparable performance to previous works, but is not ideal for the final model, especially in the DSSIM metric.

Hierarchical optimization vs joint optimization: Another architectural optimization in our work is to remove the constraint (loss) at each Laplacian pyramid level and train all network channels simultaneously with a single loss constraint. In the latter case, we call the optimization scheme joint optimization, while the former is hierarchical optimization. Some numerical results are included in the supplementary material, explaining the hierarchical optimization in more detail. In Tables 1-2, the experiments show a 10%-15% improvement over all metrics for the joint optimization scheme.

Self-comparisons on other factors: We also have a set of controlled self-comparisons on other factors, including pyramid structures, loss functions, alternate network inputs, and data augmentation.

Pyramid Structure: Experiments (Ours w/ oPyramid) show results using single-channel networks, i.e., we use a single residual block to generate output directly from the input without using a multi-channel decomposition structure. The results in Tables 1 and 2 show that our pyramid-structured counterparts improve by more than 30% compared to the control setting. Note that as the number of pyramid levels increases, the network complexity grows linearly up to a constant factor.

Loss function: Experiments (Ours w /MSE loss) show results, replacing our loss function with the classic MSE loss. It turns out that the quantization error of the MSE loss does not decrease due to large factors in the scale-invariant MSE metric. However, the qualitative results in the supplementary material do reveal that the MSE loss produces blurry edge results. The structure-based metric (DSSIM) also shows a clearer margin (from 10.23 to 8.86 in scene segmentation) between the MSE loss and our loss.

CNN features as input: We further investigate the impact of Gaussian pyramid image components as input to our network in this task, since most existing multi-scale deep networks use multi-scale features produced by CNN networks. Experiments (Ours w/'FPN' input) show the results of using CNN features entirely as input behind the FPN network. The comparison shows a slight but not clear advantage to our final model, implying that the CNN's high-level features still do a good job of preserving most of the necessary semantic information of our pixel-to-pixel translation network.

3.3 Experimental results of MIT dataset

We also evaluate the performance of our model on the MIT Intrinsic Images dataset and compare with previous methods. The result is shown in the figure below. In this set of experiments, we perform data augmentation in two different settings: Ours+DA and Ours+DA+. The difference is the data we do for augmentation. Ours + DA is a common setting where extended data is extracted from a dataset by a set of similar object category names. In Ours + DA + we are in new lighting conditions. This creates a dataset that closely resembles the original dataset, making it virtually impossible to obtain. In other words, it puts an upper limit on the quality of augmented data. Our results show that both augmentation settings are effective, while the latter provides clues to the limitations we can derive from the data augmentation scheme we introduce for this task.

4 Conclusion

We introduce a neural network architecture inspired by Laplacian's pyramid for intrinsic image decomposition. Our neural network models this problem as image-to-image translation at different scales. We conducted experiments on MPI Sintel and MIT datasets, and the experimental results show that our algorithm can get good numerical results and image quality. For future work, we expect the proposed network architecture to be tested and improved on other image-to-image translation problems, such as pixel labeling or depth regression.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us