Representative Color Transform for Image Enhancement

Introduction & Problem Statement

In the modern digital era, humanity is estimated to snap as many pictures, every two minutes, as were taken in the entire 19th century. Nevertheless, these photographs are often of low quality and dynamic range, with under or over-exposed lighting conditions. Additionally, in the field of professional photography, the go-to output format is RAW over JPEG or PNG, which maintains all dynamic information in the photograph at the cost of often darker images, which need additional processing steps. Consequently, image enhancement and refinement techniques have become increasingly prominent in order to improve the visual aesthetics of photos.

Naturally, many attempts have been proposed over the years to address the issue of image refinement, making considerable progress in that regard. In particular, contemporary research follows two distinct main approaches, namely the encoder-decoder structure (Chen et al. (2018), Yan et al. (2016), Yang et al. (2020), Kim et al. (2020)) and the performance of global enhancements through intensity transformations (Deng et al. (2018), Kim et al. (2020), Park et al. (2018), Hu et al. (2018), Kosugi et al. (2020), Guo et al. (2020)), shown in Figures 1a and 1b respectively. However, the encoder-decoder structure has some limitations in that details of the input image are not preserved, and the input is restrained to fixed sizes, whereas global-based approaches do not consider all channels simultaneously and rely on pre-defined color spaces and operations, which may be insufficient for estimating arbitrary (and highly non-linear) mappings between low- and high-quality images.

On the contrary, the recent work of Kim et al. (2021) successfully addresses most of these limitations by utilizing Representative Color Transforms (RCT). The proposed method demonstrates an increased capacity for color transformations by utilizing adaptive representative colors derived from the input image and is independently applied on each pixel, hence allowing the enhancement of images with arbitrary sizes without the need of image resizing. These advantages motivated us in reproducing their state-of-the-art architecture in the contexts of this project. An additional incentive was the lack of an official code implementation for this work, which allowed us to get hands-on experience by attempting to make our own unofficial implementation.

likelihoods — Figure 1: Outlines of image enhancement approaches: (a) encoder-decoder, (b) intensity transformation, and (c) representative color transform models, adapted from Kim et al. (2021).

In Kim et al. (2021) a novel image enhancement approach is introduced, namely Representative Color Transforms, yielding large capacities for color transformations. The overall proposed network comprises of four components: encoder, feature fusion, global RCT, and local RCT and is depicted in Figure 1c. First the encoder is utilized for extracting high-level context information, which in is in turn leveraged for determining representative and transformed (in RGB) colors for the input image. Subsequently, an attention mechanism is used for mapping each pixel color in the input image to the representative colors, by computing their similarity. The last step involves the application of representative color transforms using both coarse- and fine-scale features from the feature fusion component to obtain enhanced images from the global and local RCT modules, which are combined to produce the final image.

Implementation

RCTNet consists of 4 main components, namely encoder, feature fusion, global RCT, and local RCT, with its overall architecture being depicted in Figure 2.

Encoder

In computer vision encoders are generally used to extract high-level feature maps given an input image, by utilizing convolutional neural networks. The image is passed across multiple convolutional layers of the encoder, with each consecutive layer extracting higher level features through its increased receptive field. Nevertheless, in the case of RCTNet instead of only using the highest-level feature maps, multi-scale features are extracted from the last 4 layers of the encoder. The architecture of the encoder comprises of a stack of 6 conv-bn-swish blocks. By conv-bn-swish the authors denote a block consisting of a convolution, followed by batch normalization and a swish activation layer. The convolutional layers of the first 5 blocks use a `3 \times 3` kernel, while for the last block an `1 \times 1` kernel is used, followed by a global average pooling layer.

Feature Fusion

The feature fusion module involves the aggregation of multi-scale feature maps, and by extension the aggregation of information of different context. More specifically, feature maps of the coarsest encoder layers exploit their larger receptive fields to encapsulate global contexts, while features from lower levels preserve detailed local contexts. RCTNet's feature fusion component is constructed by bidirectional cross-scale connections, as in Tan et al. (2020), with each single input node in Figure 2 corresponding to one conv-bn-swish block. For nodes with multiple inputs a feature fusion layer precedes the conv-bn-swish block, with its output being defined as:

` O = \sum_{i=1}^{M} \frac{w_i}{\epsilon + \sum_{j} w_j} I_i `

where `w_i` are learnable weights for each input. All nodes have 128 convolutional filters with a `3 \times 3` kernel, except for coarsest-level nodes (red nodes), which use an `1 \times 1` kernel instead.

Image Feature Map

An additional independent conv-bn-swish block is applied to the input image, thus extracting the image feature map: `F \in \mathbb{R}^{H \times W \times C}` , with the value of 16 being selected for the feature dimension `C`.

Global RCT

The Global RCT component takes as input the feature map (spatial resolution: `1 \times 1`) of the feature fusion's coarsest level (last red node), utilizing its global context to determine representative color features (`R_G \in \mathbb{R}^{C \times N_G}`) and transformed colors in RGB (`T_G \in \mathbb{R}^{3 \times N_G}`) through two distinct conv-bn-swish blocks. The selected values for the feature dimension `C` and the number of global representative colors `N_G` are 16 and 64 respectively. Each of the `N_G` vectors of `T_G` (`t_i`) correspond to the transformed RGB values of the `i^{th}` representative color.

The next step involves the application of the RCT transform, which takes as inputs the reshaped input features `F_r \in \mathbb{R}^{HW \times C}`, the representative color features `R_G` and the transformed colors `T_G` and produces an enhanced image `Y_G`. Since only `N_G` representative colors are included in `T_G`, the first step of RCT involves the mapping of pixel colors of the original image to representative colors, thus the similarity between pixel and representative colors is calculated. For the latter calculation, scaled dot product and the attention mechanism are utilized as:

` A = softmax(\frac{F_r R_G}{\sqrt(C)}) \in \mathbb{R}^{HW \times N} `

where each attention weight `a_{ij}` corresponds to the similarity between the `j^{th}` representative color and the `i^{th}` pixel. Subsequently, the enhanced image `Y_G` is produced as:

` Y = A T^T `

i.e., for the `i^{th}` pixel, the products of its attention weights with the `j^{th}` transformed representative colors are summed to determine the pixel's enhanced RGB values.

Local RCT

The Local RCT component takes as input the feature map (spatial resolution: `32 \times 32`) of the feature fusion's finest level (last blue node), utilizing the contexts of local information this time to determine representative color features (`R_L \in \mathbb{R}^{32 \times 32 \times C \times N_L}`) and transformed colors in RGB (`T_L \in \mathbb{R}^{32 \times 32 \times 3 \times N_L}`) through two distinct conv-bn-swish blocks. The selected values for the feature dimension `C` and the number of local representative colors `N_L` are both 16.

Subsequently, the local RCT module takes as inputs `R_L` and `T_L` and produces different sets of representative features and transformed colors for different areas of the input image. To achieve that, a `31 \times 31` uniform mesh grid is set on the input image, thus producing `32 \times 32` corner points in the image (each corresponding to one of the `32 \times 32` spatial positions of `R_L` and `T_L`), as shown in Figure 3 for a `5 \times5` mesh grid example. Each grid position `B_k` is related to four corner points, thus four sets of representative features and transformed colors, which are concatenated to produce `R_k` and `T_k`. A grid image feature `F_k` is also extracted from `F`, described in the Image Feature Map Section, by making the corresponding crop on the grid region. Finally, `F_k`, `R_k` and `T_k` are fed to the RCT transform described in the Global RCT section to yield the local enhanced image region `Y_k`. This process is replicated for all grid positions to produce the final enhanced image `Y_L`.

Global-Local RCT Fusion

Finally, the enhanced images obtained from the global `Y_G` and local `Y_L` RCT components are combined to produce the final enhanced image `\tilde{Y}` as:

` \tilde{Y} = \alpha Y_G + \beta Y_L`

where `\alpha` and `\beta` are non-negative learnable weights.

Loss Function

The used loss function comprises of two distinct terms, with the first term corresponding to the mean absolute error (L1 loss) between the predicted and ground-truth enhanced images. The second term is the sum of the L1 losses between the feature representations extracted for the predicted and ground-truth images from the `2^{nd}`, `4^{th}`, and `6^{th}` layer of a VGG-16 [Simonyan et al. (2014)] network, pretrained on ImageNet [Russakovsky et al. (2015)]. Consequently, given `\tilde{Y}`: the high-quality image prediction and `Y`: the ground-truth high-quality image, the loss function is given as:

` \mathcal{L} = || \tilde{Y} - Y ||_1 + \lambda \sum_{k=2,4,6} || \phi^k(\tilde{Y}) - \phi^k(Y) ||_1`

where the hyperparameter `\lambda` was set to 0.04 to balance the two terms.

Experiments

Dataset

The LOw-Light (LOL) dataset [Wang et al. (2004)] for image enhancement in low-light scenarios was used for the purposes of our experiment. It is composed of a training partition, containing 485 pairs of low- and normal-light image pairs, and a test partition, containing 15 such pairs. All the images have a resolution of `400 \times 600`. For the purposes of training, all images were randomly cropped and rotated by a multiple of 90 degrees.

Evaluation Metrics

The perceived enhancement of an image from different methods can be subjective. Therefore, it is salient to establish certain metrics that would allow the comparison of different image enhancement algorithms on the produced image quality. For the quantitative evaluation of RCTNet we leveraged two distinct evaluation metrics, which are well-established for assessing image enhancement models, namely peak signal-to-noise ratio (PSNR) and Structural SIMilarity (SSIM).

PSNR corresponds to the ratio between the power (maximum value) of a signal and the power of a noisy distortion and is expressed in a logarithmic decibel scale. In the image domain it measures the ratio between the power of the ground-truth enhanced image (`Y`) and the power of the enhanced image prediction (`\tilde{Y}`), produced by the network, as:

` PSNR = 20log_{10}(\frac{max(Y)}{MSE(Y,\tilde(Y))}) `

where MSE is the mean squared error between the ground-truth and predicted images. Therefore, the higher PNSR values correspond to better reconstruction of the degraded images. For colour images, the MSE is averaged across individual channels. Nevertheless, PSNR is limited in that it solely relies on numerical pixel value comparisons, disregarding biological factors of the human vision systems, which brings us to SSIM.

SSIM, introduced by Wei et al. (2018), attempts to replicate the behaviour of the human visual perception system, which is highly capable of identifying structural information in a scene, and by extension differences between the predicted and ground-truth enhanced versions of an image. The value ranges from `-1` to `1`, where `1` corresponds to identical images. SSIM extracts 3 key features from an image, namely luminance, contrast, and structure and subsequently applies certain comparison functions to these features to compare the given images, followed by a final combination function that aggregates the final result.

Quantitative Evaluation

The results, in terms of the PSNR and SSIM evaluation metrics, calculated for our implementation of RCTNet are depicted in Table 1, along with results of competing image enhancement methods and the official implementation of RCTNet, as reported in Kim et al. (2021). It becomes evident that our results do not approximate those reported for the official implementation for both examined metrics.

Table 1: Quantitative comparison on the LoL dataset [Wang et al. (2004)]. The best results are boldfaced and the second best ones are underlined. Our results correspond to the mean value of 100 random seed executions (*).
Method	PSNR	SSIM
NPE [Wang et al. (2013)]	16.97	0.589
LIME [Guo et al. (2016)]	15.24	0.470
SRIE [Fu et al. (2016)]	17.34	0.686
RRM [Li et al. (2016)]	17.34	0.686
SICE [Cai et al. (2018)]	19.40	0.690
DRD [Wei et al. (2018)]	16.77	0.559
KinD [Zhang et al. (2019)]	20.87	0.802
DRBN [Yang et al. (2020)]	20.13	0.830
ZeroDCE [Guo et al. (2020)]	14.86	0.559
EnlightenGAN [Jiang et al. (2021)]	15.34	0.528
RCTNet [Kim et al. (2021)]	22.67	0.788
RCTNet (ours)*	19.96	0.768
RCTNet + BF [Kim et al. (2021)]	22.81	0.827

Interestingly, the results of Table 1 deviate significantly in case the augmentations proposed by the authors (random cropping and random rotation by a multiple of 90 degrees) are also used during the evaluation. This finding indicates that the model favours augmented images, since during training we performed augmentation operations on all input images and for every epoch. While the authors refer to the same augmentations, they do not specify the frequency, with which those augmentations were performed. This phenomenon becomes more evident by looking at the quantitative results, when augmentations were used on the test images, as shown in Table 2. Furthermore, the innate randomness of the augmentation operations leads to a high variance for both metrics, and thus a less robust model. To account for this variance, we executed our evaluation for 100 randomly selected seeds. The mean, standard deviation, maximum, and minimum values for both evaluation metrics are shown in Table 2, when augmentations are also included in the test set. Additionally, in Figures 4.a and 4.b, the plotted density distributions for PSNR and SSIM, respectively, depict the observed high variance for both metrics.

Table 2: Mean, standard deviation, maximum, and minimum values for PSNR and SSIM, for 100 executions with different random seeds, when augmentations are also included in the test set.
Evaluation Metric	Mean	Standard Deviation	Max	Min
PSNR	20.522	0.594	22.003	18.973
SSIM	0.816	0.009	0.839	0.787

PSNR density distribution — Figure 4: Density distributions of the measured values for (a) PSNR and (b) SSIM after 100 executions with different random seeds, when augmentations are also included in the test set.

SSIM density distribution — Figure 4: Density distributions of the measured values for (a) PSNR and (b) SSIM after 100 executions with different random seeds, when augmentations are also included in the test set.

Qualitative Evaluation

In Table 3 some image enhancement results of the implemented RCTNet are shown, compared to the low-light input images and the ground-truth normal-light output images. From these examples it becomes evident that RCTNet has successfully learned how to enhance low-light images, achieving comparable results to the ground-truth images in terms of exposure and color-tones. Nevertheless, the produced images are slightly less saturated with noise being more prominent. It was conjectured that by training the network for more epochs, some of these limitations could be alleviated. It is also observed that RCTNet fails to extract certain representative colors that are only available in small regions of the input image (e.g. the green color for the `4^{th}` image).

Table 3: Qualitative comparison on the LoL dataset for an RCTNet trained for 500 epochs.
Input	RCTNet	Ground-Truth

Conclusions

In conclusion, our analysis did not show comparable results to the ones presented in the original paper. The qualitative evaluation on the LOL dataset facilitated our implementation's capability of learning to successfully enhance low-light images with color tones matching those of the ground-truth enhanced image. The observed dissimilarity in terms of color saturation could possibly be accounted for by tuning certain hyperparameters of the model or training for more epochs. Regarding our quantitative findings, the measured values for both PSNR and SSIM, were lower for our implementation compared to the ones corresponding to the original implementation. Nevertheless, these discrepancies could be attributed to the frequency with which the image augmentations were performed in our implementation during training.

References

[1] Kim, H., Choi, S. M., Kim, C. S., & Koh, Y. J. (2021). Representative Color Transform for Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4459-4468).

[2] Chen, Y. S., Wang, Y. C., Kao, M. H., & Chuang, Y. Y. (2018). Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6306-6314).

[3] Yan, Z., Zhang, H., Wang, B., Paris, S., & Yu, Y. (2016). Automatic photo adjustment using deep neural networks. ACM Transactions on Graphics (TOG), 35(2), 1-15.

[4] Yang, W., Wang, S., Fang, Y., Wang, Y., & Liu, J. (2020). From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3063-3072)

[5] Kim, H. U., Koh, Y. J., & Kim, C. S. (2020, August). PieNet: Personalized image enhancement network. In European Conference on Computer Vision (pp. 374-390). Springer, Cham.

[6] Deng, Y., Loy, C. C., & Tang, X. (2018, October). Aesthetic-driven image enhancement by adversarial learning. In Proceedings of the 26th ACM international conference on Multimedia (pp. 870-878).

[7] Kim, H. U., Koh, Y. J., & Kim, C. S. (2020, August). Global and local enhancement networks for paired and unpaired image enhancement. In European Conference on Computer Vision (pp. 339-354). Springer, Cham.

[8] Park, J., Lee, J. Y., Yoo, D., & Kweon, I. S. (2018). Distort-and-recover: Color enhancement using deep reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5928-5936).

[9] Hu, Y., He, H., Xu, C., Wang, B., & Lin, S. (2018). Exposure: A white-box photo post-processing framework. ACM Transactions on Graphics (TOG), 37(2), 1-17.

[10] Kosugi, S., & Yamasaki, T. (2020, April). Unpaired image enhancement featuring reinforcement-learning-controlled image editing software. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 11296-11303).

[11] Guo, C., Li, C., Guo, J., Loy, C. C., Hou, J., Kwong, S., & Cong, R. (2020). Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1780-1789).

[12] Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781-10790).

[13] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[14] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3), 211-252.

[15] Wei, C., Wang, W., Yang, W., & Liu, J. (2018). Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560.

[16] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4), 600-612.

[17] Wang, S., Zheng, J., Hu, H. M., & Li, B. (2013). Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE transactions on image processing, 22(9), 3538-3548.

[18] Guo, X., Li, Y., & Ling, H. (2016). LIME: Low-light image enhancement via illumination map estimation. IEEE Transactions on image processing, 26(2), 982-993.

[19] Fu, X., Zeng, D., Huang, Y., Zhang, X. P., & Ding, X. (2016). A weighted variational model for simultaneous reflectance and illumination estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2782-2790).

[20] Li, C. Y., Guo, J. C., Cong, R. M., Pang, Y. W., & Wang, B. (2016). Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior. IEEE Transactions on Image Processing, 25(12), 5664-5677.

[21] Cai, J., Gu, S., & Zhang, L. (2018). Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing, 27(4), 2049-2062.

[22] Zhang, Y., Zhang, J., & Guo, X. (2019, October). Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia (pp. 1632-1640).

[23] Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., ... & Wang, Z. (2021). Enlightengan: Deep light enhancement without paired supervision. IEEE Transactions on Image Processing, 30, 2340-2349.