◎ JADH2016

Sep 12-14, 2016 The University of Tokyo

Comparisons of Different Configurations for Image Colorization of Cultural Images Using a Pre-trained Convolutional Neural Network
Tung Nguyen, Ruck Thawonmas, Keiko Suzuki, Masaaki Kidachi (Ritsumeikan University)

This paper describes image colorization of cultural images, such as ukiyo-e, by which colors are added to grayscale images. This is done in order to make them more aesthetically appealing, culturally meaningful, or even inspiring. Importance of this task can be seen, for example, by a relatively large portion of grayscale images in the archive portal of the Art Research Center (ARC), Ritsumeikan University, e.g., 1600 grayscale images out of 4588 images of the type Yakusha-e (actor painting) publicly accessible.

In this work, we followed the same approach as Gatys et al. [1] that uses a pre-trained convolutional neural network (CNN), called VGG-19 [2], for transferring the style of an image to another image while maintaining the content of the latter one. In particular, using ukiyo-e images from the aforementioned archive, we investigated a number of configurations for setting VGG-19’s layers, weighting between the style loss and the content loss, and optimizing the parameters. Discussions are done that give insights to future work.


The content of a grayscale image is combined with the style of a color image, resulting in colorizing the grayscale image. For a layer \(l\) in the network, we denote the number of feature maps and the size of each feature map in that layer as \(N_l\) and \(M_l\) , respectively. The content loss is then calculated by \begin{align*} L_{content}(p,x)=\frac{1}{N_lM_l}\Sigma_{i,j}(P_{ij}^{l}-F_{ij}^{l})^2, \end{align*} where \(P^l\in\mathbb{R}^{N_l\times M_l}\) and \(F^\in\mathbb{R}^{N_l\times M_l}\) are the content representations, i.e. the features, of the content image \(p\) and the output image \(x\), respectively. On the other hand, the style representation at layer \(l\) is given by the Gram matrix \(G^l\in\mathbb{R}^{N_l\times M_l}\): \begin{align*} G^i=F^l(F^l)^T \end{align*} and the style loss at layer \(l\) is calculated by \begin{align*} E_l=\frac{1}{N_l^2}\Sigma_{i,j}\begin{pmatrix}\frac{A_{ij}^l}{M_l}-\frac{G_{ij}^l}{M_l}\end{pmatrix} \end{align*} where \(A^l\) and \(G^l\) are the style representations of the style image \(a\) and the output image \(x\), respectively. Then the style loss function is defined considering style losses at multiple layers: \begin{align*} L_{style}(a,x)=\Sigma_{l}w_lE_l, \end{align*} where \(w_l\) is the weighting factor of \(E_l\), and equals to one divided by the number of layers. In addition, to smoothen the output image, we make use of the total variation regularizer given below: \begin{align*} L_{tv}(x)=\Sigma_{i,j}\begin{pmatrix}\begin{pmatrix}x_{i+1,j}-x_{i,j}\end{pmatrix}^2\begin{pmatrix}x_{x,j+1}-x_{x,j}\end{pmatrix}^2\end{pmatrix} \end{align*} Finally, the total loss is calculated as the weighted average of the aforementioned losses: \begin{align*} L(p,a,x)=\alpha L_{content}(p,x)+\beta L_{style}(a,x)+\gamma L_{tv}(x). \end{align*}


We conducted various experiments to compare different configurations. Two layer settings by Gatys et al. [1, 3] and one by Yin [4] were considered. We also compared the use of stochastic gradient descent (SGD) [1] with that of LBFGS [3], as an optimization algorithm for finding a minimum of \(L\). Moreover, we investigated the effect of decreasing \(\frac{\beta}{\alpha}\) by 0.25% after each iteration [5].

Table 1. Description of configuration names.

Figure 1. Relative error of each configuration at the last iteration.

Figure 2. Sample images generated with different configurations: config3_1_4 (the smallest relative content error), config2_3_1 (the smallest relative style error), config3_3_1 (the smallest relative error), config3_3_4 (the visual best).

Combing the aforementioned layer settings, optimization methods and different values of \(\frac{\beta}{\alpha}\) leads to 36 different configurations in total. We use the format configX_Y_Z for naming each configuration; the value and meaning of each index X, Y, Z are provided in Table 1. For each configuration, we performed colorization 100 times, combining each content of 10 grayscale images with each style of 10 color images. \(\gamma\) was set to 0.001 and the output image was initialized with the content image as done in [5].

Because the range of the total loss varies considerably depending on the set of layers, we instead used the relative error defined below as a metric to compare different configurations: \begin{align*} Err(p,a,x)=Err_{content}(p,x)+Err_{style}(a,x), \end{align*} where \(Err_{content}\) and \(Err_{style}\) are the relative content error and the relative style error respectively. \begin{align*} Err_{content}(p,x)=\Sigma_{i,j}\frac{\begin{vmatrix}P_{ij}^i-F_{ij}^l\end{vmatrix}}{\begin{pmatrix}\begin{vmatrix}P_{ij}^l\end{vmatrix}+\begin{vmatrix}F_{ij}^l\end{vmatrix}\end{pmatrix}/2} \end{align*} \begin{align*} Err_{style}(a,x)=\Sigma_lw_l\Sigma_{i,j}\frac{\begin{vmatrix}A_{ij}^i-G_{ij}^l\end{vmatrix}}{\begin{pmatrix}\begin{vmatrix}A_{ij}^l\end{vmatrix}+\begin{vmatrix}G_{ij}^l\end{vmatrix}\end{pmatrix}/2} \end{align*}

Figure 1 shows the relative error of each configuration at the last iteration of the colorization process. The best configurations in terms of relative content error, relative style error and relative error are config3_1_4, config2_3_1, and config3_3_1 respectively. However, because from cultural viewpoints, it is important to maintain the original content, we visually compared images generated by the best five configurations in terms of the relative content error and finally selected config3_3_4 as the visual best.

Figure 2 show results for three typical content-style pairs. Both config2_3_1 are config3_3_1 are colorful, but texture information in the content image, in particular kimono patterns and Japanese letters, is distorted. Visual comparisions reveal that while not as colorful as config2_3_1 or config3_3_1, config3_3_4 best preserves the original content while having color essence of the style image. As this kind of colorization is not only aesthetically appearing but also culturally meaningful, it may inspire the CNN user's artistic creativity.


In this paper, we compared a number of configurations for colorizing grayscale images by utilizing a pre-trained CNN, VGG-19. Our finding is that config3_3_4 leads to the best visual results regarding both original content preservation and color essence introduction. This work was based on existing work using VGG-19 for style transfer, not specifically for color transfer. In the future, we plan to develop loss functions and other settings that can ignore texture information and transfer only color information from the style image while preserving the content in the original image.


[1] Gatys, L.A., Ecker, A.S. and Bethge, M., 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.

[2] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. Presented at International Conference on Learning Representations 2015.

[3] Gatys, L., Ecker, A.S. and Bethge, M., 2015. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 262-270).

[4] R. Yin, 2016. Content Aware Neural Style Transfer. arXiv preprint arXiv:1601.04568.

[5] Nguyen, T., Mori, K., and Thawonmas, R., 2016. Image Colorization Using a Deep Convolutional Neural Network. In Proc. of ASIAGRAPH 2016, Toyama, Japan (pp. 49-50).