Image Generation Algorithm Based on Stylization Against Self Encoder


In this paper, we propose an autoencoder-based generative adversarial network (GAN) for automatic image generation, which we call "stylized adversarial autoencoders". Unlike existing generative autoencoders (which usually impose a prior distribution on the latent vectors), our proposed method splits the latent variables into two components: style features and content features, both of which are based on real image encoding. This division of latent vectors allows us to arbitrarily adjust the content and style of the generated images by choosing different example images.

In addition, a multi-class classifier is used as a discriminator in this GAN network, which can make the generated images more realistic. We conduct experiments on handwritten digits, scene characters, and face datasets, and show that stylized adversarial autoencoders can achieve excellent image generation results and significantly improve the corresponding supervised recognition tasks.


Generative modeling of natural images is a fundamental research problem in the fields of computer vision and machine learning. Early studies paid more attention to the statistical principles of generative network modeling, but due to the lack of effective feature representation methods, the corresponding results were limited to certain specific modes. Deep neural networks have shown significant advantages in learning representations and have been shown to be effective for discriminative vision tasks such as image classification and object detection, leading to a series of Deep generative models.

Studies have shown that regularized neural networks often outperform unconstrained networks in practice. Commonly used regularization forms include L1 norm LASSO, L2 norm ridge regression, and some modern techniques such as dropout. Especially for autoencoder neural networks, researchers have recently proposed quite a few regularization methods. However, all of these regularization methods impose a prior distribution, often Gaussian, on the latent variables (also known as hidden nodes).

This approach works well for relatively simple generative tasks, such as modeling grayscale digital images, but is not suitable for generating complex images such as color alphanumeric images or human faces because of the large number of latent variables in these images. The true distribution is invisible and cannot be modeled with simple models.

As shown in Figure 1, in this paper we propose a new generative model named Stylized Adversarial Autoencoder (SAAE), which uses an adversarial approach to train stylized autoencoders.

Unlike existing autoencoders, we divide the latent vector into two parts, one related to image content and the other related to image style. Both content and style features are encoded from example images without using any prior assumptions on the distribution of latent variables. The target image with given content and style can be decoded according to the combined latent variables, which means that we can adjust the output image by selecting different example content and/or style images.

Furthermore, inspired by the methods in [1, 2, 3], we employ an adversarial approach in the model training phase. Instead of using a typical binary classifier as the discriminator, our GAN network uses a multiclass classifier that better models the variation in generated images when distinguishing real from fake images. Furthermore, since GAN model training is a game-form min-max objective, it is very difficult to converge, so we empirically develop an efficient three-step training algorithm that can improve the convergence performance of our proposed GAN network.

Figure 1: Illustration of our model, which extracts features from content and style images separately, and then fuses these features to decode the target image. A multi-class discriminator forces the generated images to be more realistic.

The main contributions of this work can be summarized as:

We propose a novel deep autoencoder network, which can separately encode content features and style features from two example images and decode new images from these two features.

A multi-class classifier is used as the discriminator, which better models the variability of the generated images and effectively forces the generative network to produce more realistic results.

We develop a three-step training strategy to ensure the convergence of our proposed stylized adversarial autoencoder.

2: Stylized Adversarial Autoencoder

For convenience, we will use text character image generation (such as scene text generation, etc.) as a background application to introduce our algorithm, but we will also show more applications (such as face generation) in the experimental part. Our goal is to generate images from two example images (content image c and style image s) by defining and training a neural network. In terms of character image generation, the content image refers to a synthetic character image without any style or texture or background, such as A to Z, 0 to 9; the style image is an example image, such as a real word image.

As mentioned before, we split the latent variables that reveal the prior distribution of the real data into two parts: style features and content features. Content features are derived from content images (via a convolutional network), while style features are derived from style images.

2.1 Generator

The generative network consists of two encoders (Enc_c and Enc_s) and a decoder (Dec). where Enc_c encodes content images into content implicit representations or features z_c, and Enc_s encodes style images into style implicit representations or features z_s. Dec decodes the combined latent representation and produces an output image. For convenience, we use the generator G to denote the combination of Enc_c, Enc_s, and Dec.

2.2 Discriminator

The output of the discriminator in existing GANs is the probability y = Dis(x) ∈ [0,1] that the output x is a real image. While the discriminator D is trained to minimize binary cross-entropy: L_{dis} = −log(Dis(x))−log(1−Dis(G(z))).

The goal of G is to generate images that D cannot distinguish from real images, i.e. maximize Ldis. As mentioned earlier, existing GAN networks use a binary classifier in D to determine whether an image is real or generated. However, putting all real images into one large positive class would fail to exploit the intrinsic semantic structure of these training images. Therefore, we propose to use a multi-class classifier as a discriminator, which will determine whether the input generates an image or belongs to a certain real image class (such as a specific character).

2.3 Network Architecture

Convolutional neural network (CNN) has shown great advantages in feature representation and image generation, and our proposed SAAE network is based on the CNN architecture, as shown in Figure 2.

In fact, our proposed generative network consists of two feature extraction network pipelines followed by a generative network. Both the content feature extractor and the style feature extractor have three convolutional layers without downsampling, so as to preserve as much detail information of the example image as possible. The input style image and content image may have different dimensions.

For example, when generating scene text character images, the content image is an image containing one character, and the style image is an image containing one word or multiple characters. After three convolutional layers, the style feature map is reshaped by a fully connected layer into a style feature vector. In order to be stitched together with the content feature map decoded from the content image, the style feature vector needs to be rescaled back to a feature map with the same size as the content feature map.

The content feature extraction network does not have any fully-connected layers because the two-dimensional spatial information of content images needs to be preserved. We merge content feature maps and style feature maps in the channel dimension, which means that half of the channels of the combined feature map are from content features and the other half are from style features. Afterwards, the generative network uses three convolutional layers to decode the combined feature maps into a target character image.

The discriminative network is a common CNN classifier that consists of three convolutional layers, where the first convolutional layer is followed by a 2×2 max pooling layer, and the last convolutional layer is followed by two fully connected layers. The output layer of the discriminator is a (k+1) dimensional vector representing the probability that the input image belongs to each class (real images have k classes, fake images account for 1 class).

We apply batch normalization on each convolutional layer, which speeds up the convergence during the training phase. Every layer except the last uses Leaky ReLU, which uses a sigmoid to project each output into the [0,1] interval (as a probability).

2.4 Training Strategy

Inspired by the step-by-step training used in [4], we propose a three-step training strategy to optimize our model. This three-step optimization strategy can help us get stable training results.

3: Experiment

We evaluated our method using 4 different methods: computing the log-likelihood on the MNIST dataset to measure the ability of the SAAE model to fit the data distribution; demonstrating visual attribute transfer on the face generation task; Evaluate SAAE models on ; generate training data for supervised recognition tasks.

3.1 Log likelihood analysis

Inspired by the evaluation procedure in [1,3], we evaluate the performance of SAAE as a generative model for fitting data distributions by computing the log-likelihood of the estimated distribution of generated images versus the distribution of the MNSIT test set.

Table 1 compares the log-likelihood results of SAAE with six current best methods. Our method performs state-of-the-art on this criterion, outperforming AAE by about 89.

Table 1: Log-likelihoods of the test data on the MNIST dataset. Higher values are better. The last two rows of results are from our method using a binary discriminator and a multiclass discriminator, respectively. The values reported here are the average log-likelihood of the samples on the test set and the standard error of the mean calculated over multiple experiments.

Following the previous approach, we show some samples from the trained SAAE generator in Figure 3. The last column is the closest training image (measured by pixel-level Euclidean distance) to the generated image in the penultimate column, to demonstrate that the SAAE model does not memorize the training set.

3.2 Face generation based on attribute conditions

We evaluate the performance of our model on the face image generation task on the Labeled Faces in the Wild (LFW) dataset.

As shown in Figure 4, the generated samples are visually consistent with the attribute transfer. For example, if you change an attribute such as "glasses", the overall appearance will still be preserved, but the eye area will be different.

3.3 Model samples

We evaluate our SAAE model on the IIIT 5k-word (IIIT5K) dataset and the Chinese car license plate (PLATE) dataset.

Figure 5 shows a random sample of images generated by our model and the DCGAN model, along with the training data for comparison. The samples generated by SAAE look more character-like and have sharper edges and backgrounds.

To visualize the stylization properties of our stylized adversarial autoencoder, we show several sets of generated samples in Figure 6, both on the IIIT5K and PLATE datasets. In each dataset, we selected an example style image and iterated over all content images and labels. The results show that the SAAE model can transfer the character style of the example style image to the content image.

Figure 6: Samples generated given a style image. Upper row: IIIT5K dataset. Bottom row: PLATE dataset. For each set of generated samples, the style image is given in the upper left corner, marked with a red box. For the PLATE dataset, we hide the first Chinese character of the car license plate for privacy reasons.

3.4 Data Generation for Supervised Learning

Deep neural networks (DNNs) have shown remarkable superiority in supervised learning, but they rely on large-scale annotated training data. On small training data, deep models are prone to overfitting. We also generated training data for the task of recognizing Chinese license plates using the SAAE model.

We evaluated the quality of data generation by measuring the recognition accuracy on the DR-PLATE dataset. Figure 7 shows that the more generated data is added to the training dataset, the slower the model converges, but the classification accuracy gets better and better. This result shows that our SAAE model can improve the performance of supervised learning by generating data.

Future research focuses on optimizing the network structure to achieve higher generation quality. It would also be an interesting research direction to extend this framework to other application domains such as semi-supervised feature learning.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us