A CUDA-enabled parallel algorithm for accelerating retinex

      Yuan-Kai Wang , Wen-Bin Huang

Department of Electrical Engineering Fu Jen Catholic University

        Retinex is an image restoration approach used to restore the original appearance of an image. Among various methods, a center/surround retinex algorithm is favorable for parallelization because it uses the convolution operations with large-scale sizes to achieve dynamic range compression and color/lightness rendition. This paper presents a GPURetinex algorithm, which is a data parallel algorithm accelerating a modified center/surround retinex with GPGPU/CUDA. The GPURetinex algorithm exploits the massively parallel threading and heterogeneous memory hierarchy of a GPGPU to improve efficiency. Two challenging problems, irregular memory access and block size for data partition, are analyzed mathematically. The proposed mathematical models help optimally choose memory spaces and block sizes for maximal parallelization performance. The mathematical analyses are applied to three parallelization issues existing in the retinex problem: block-wise, pixel-wise, and serial operations. The experimental results conducted on GT200 GPU and CUDA 3.2 showed that the GPURetinex can gain 74 times acceleration, compared with an SSE-optimized single-threaded implementation on Core2 Duo for the images with 4096 x 4096 resolution. The proposed method also outperforms the parallel retinex implemented with the nVidia Performance Primitives library. Our experimental results indicate that careful design of memory access and multithreading patterns for CUDA devices should acquire great performance acceleration for real-time processing of image restoration

 Experimatal Result
        An illustration of the algorithm is shown in Fig. 1. The example image presents a low-key image of an indoor scene with non-uniform illumination at night. The low-key image contains a high dynamic range and a considerably dark region. One person is standing in the dark region, and it is not easy to perceive this person. Figs. 1(c) and 1(e) show the enhancement images of MSRCR and the modified MSRCR accompanied by alpha-trimmed contrast stretching. The person in the dark region becomes visible, and the details of the person are discernible in Fig. 1(e). 

Fig. 1. Image enhancement using the two algorithms with Wz =1/3,1/3,1/3, vz=14,69,210, and a=0.015 for the image size of 4096 x 4096. (a)(b) The source image and its histogram , (c)(d) MSRCR result and its histogram, (e)(f) GPURetinex result and its histogram, (g) transfer function in the alpha-trimmed contrast stretching.  

      The experimental results in Fig. 2 show that the most favorable memory spaces for the I and r in row filtering are texture and constant caches, which matches the mathematical analysis presented in Table 2. The worst selection is global memory for both I and r. The improvement ratio is 5.45 for the worst case over the best. It means that the careful design of memory spaces can achieve more than five times improvement in the same GPGPU platform. The group of I being texture memory performs more favorably than the group of I being global memory. The result is reasonable because of the temporal reuse of source image pixels, and texture memory has the most effective support for the access of locality. Storing both I and r in texture memory could be a simple scheme with low execution time, but the experiment showed that execution time is improved when the convolution data r are stored in constant memory. This is mainly because the image is stored in texturing memory and is heavily accessed; therefore, offloading the access of the convolution data to constant memory relieves a bottleneck of the system.


Fig. 2. Execution times of eight combinations of memory selections for row filtering.


      Fig. 3 shows the accelerations of the GPURetinex compared to the CPURetinex. Acceleration is measured using the total execution time that does not include the cost of memory management and data transfer between the GPU and CPU. Relative acceleration is also linear according to the increase of image dimensions, thus indicating that the GPU implementation is memory bound rather than operationally bound. For the image of 4096 x 4096 resolution, the GPURetinex can gain 74x acceleration when compared with CPURetinex, and is twice as fast as GPURetinex_NPP. The experimental results demonstrate an excellent speed boost.

Fig. 3. The accelerations of the GPU algorithms over CPU versions.


      We further investigate the performance of the GPURetinex according to Gflops. The Gflops of the GPURetinex is defined as (O/T) * 10-9, where T is the total execution time of GPURetinex (second), and O is the number of floating-point operations in the modified retinex algorithm. As shown in Fig. 4, the performance converges to 95 Gflops when image sizes increase. It is 1/ 10 of theoretical peak Gflops of this GPGPU device because of the parallel overhead in the convolution stage. Furthermore, the access ratio and the bandwidth of global memory can affect the Gflops .We illustrate the execution times of each operation in the modified retinex algorithm. The execution times of the four steps, including the Gaussian blur, log-domain processing, histogramming, and stretching steps, are compared according to image size.

Fig. 4. The Gflops of GPURetinex.

      We first show the execution times for CPURetinex in Fig. 5(a). The execution times of all steps increase in proportion to image sizes and the Gaussian blur always consumes the most processing time. The percentages of computational loading are 80.1% for Gaussian blur, 18.4% for log-domain processing, and 1.5% for histogramming and stretching. Fig. 5(b) shows the execution times for GPURetinex. The Gaussian blur still consumes the most processing time. The execution times do not increase proportionally to small image sizes, possibly because the memory transfer of small image data corresponding to the high memory bandwidth of global memory exhibits a nonlinear effect.


Fig. 5. The execution times of various steps for (a) CPURetinex, (b) GPURetinex.

Demo Video


[1] Wang, Y. K., Huang, W. B.: Acceleration of an improved Retinex algorithm. IS&T/SPIE Electronic Imaging, Parallel Processing for Image Applications, Proc. SPIE 7872, California, USA (2011)
[2]Wang, Y. K., Huang, W. B.: Acceleration of the Retinex algorithm for image restoration by GPGPU/CUDA. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 72-77. Colorado, USA (2011)

[3]Wang, Y. K., Huang, W. B.: A CUDA-enabled parallel algorithm for accelerating retinex. ISSN, Journal of Real-Time Image Processing, 2012.


[1] Jang,
B.,Schaa, D., Mistry, P., Kaeli, D.:Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel and Distributed Computing 22(1), 105-118, (2011)