Implementation and Analysis of A Parallel Retinex Algorithm
 
 
 
Wen-Bin Hunag(黃紋彬)
中華民國 100 年 6 月
輔仁大學---智慧型系統實驗室---嵌入式視覺組
 
 
Abstract
    Retinex is an image restoration method that can restore the image’s original appearance. Among various approaches the center/surround Retinex algorithm is favorable for parallelization because it utilizes a convolution operation with large kernel size to achieve dynamic range compression and color/lightness rendition. Its great capability in image enhancement comes with intensive computation and has the high time complexity of . This paper presents a GPURetinex algorithm, which is a data parallel algorithm devised by parallelizing the Retinex based on GPGPU/CUDA. The GPURetinex algorithm exploits GPGPU’s massively parallel architecture and hierarchical memory to improve efficiency. In addition, a computational analysis of Retinex is discussed in detail in this paper, including serial time complexity, parallel time complexity, and ideal speedup. In our experiments, the GT200 GPU and CUDA 3.2 are employed. The experimental results show that the GPURetinex can gain 74 times speedup compared with the optimized single-threaded CPU implementation by OpenCV for the images with 4096 x 4096 resolution and the parallel efficiency is 71.37%. The proposed method also outperforms a parallel Retinex implementation based on the NPP. Our experimental results indicate that using GPGPU/CUDA can acquire great performance acceleration and gain real-time performance.
 
 
The GPURetinex Method
    The center/surround Retinex algorithms are an inherently parallel problem. Each spectral band independently performs the computation of center/surround information with different scales, log-domain processing, normalization, and histogram stretching. Each of these steps can be parallelized and run as kernels on the GPGPU. The GPURetinex uses the heterogeneous programming model provided by CUDA, where the serial code segments are executed on the host (CPU) and only the parallel code segments are executed on the device (GPU). The host loads an original image and performs the image transfer from host to device. Then four steps, including Gaussain blur, log-domain processing, normalization, and histogram stretching, are parallelly executed with Single Program Multiple Data (SPMD) model in the GPGPU. The final step is to transfer the result from device to host.

 

 
 
The Overview of Memory Usage
    In the parallel Gaussian blur step, the texture memory and constant memory are utilized to improve efficiency because the repetitions of reading data are frequent. For the parallel log-domain processing step, the source image, Gaussian blur image, and results image all adopt the global memory. After the parallel log-domain processing step, a parallel reduction method is devised to find the maximum and minimum values of the log-domain processing image. Threads within a grid communicate by global memory and shared memory. Then all threads read the maximum and minimum values form the constant memory and process the computation of normalization. The final parallel histogram stretching step is to enhance the algorithm for more challenging problem. The global memory, shared memory, and constant memory are utilized to complete the computation of parallel histogram stretching.
 
 

 

 

 

 

 

Conclusions
    This paper presents a GPU-accelerated data parallel algorithm called GPURetinex that is proposed to parallelize the Retinex algorithm. A parallel histogram stretching is devised to gain the best color rendition and contrast. The computational complexity of Retinex is very high, especially in computing the center/surround information. In addition, a computational analysis of Retinex is very important for parallelism. These analyses include serial time complexity, parallel time complexity, and ideal speedup. The Retinex is an inherently parallel problem and the parallelization by GPGPU/CUDA can greatly improve its performance. However, a complete understanding of the memory hierarchy and programming model is very important for the parallelization. High processor occupancy is also critical for maximum performance on the GPGPU. Moreover, we should pay attention to the data distribution for each thread block and data allocation in the memory hierarchy. Our experimental results show that using the GPGPU/CUDA can greatly accelerate the Retinex algorithm and the GPURetinex can gain 74 times speedup compared with the CPU-based implementation on the images with 4096 x 4096 resolution and the parallel efficiency is 71.37%.
 
 
Demo Video
 
 
 
輔仁大學---智慧型系統實驗室