Hi,
I need expert cuda programming. I am working on jetson TX2.
I have an image of 48Mp (6004*7920), and a map which indicates bad pixel. The number of bad pixel is around 2% of the total image, that means around 951033.
I handle image using indixation. My image is just 1D-array of 48000000 elements (float), and I handle col and line like that :
for(int i = 0 ; i<height ; i++){ for(int j = 0 ; j<width ; j++){ index = i*width + j ; // line i and col j } }
Thanks to the map, I know which pixels are bad. Here, I have all the bad pixels :
for(int j = 0 ; j<numerBadPixel ; j++){ cout << img[map[j]] << endl ; ; // all bad pixel }
For each bad pixel pixels, I apply 3*3 median filter. Note that I am not interesting by the edge and the corner.
I’ve implemented a code on the CPU, and the execution time is around 80 ms (using multithreading). Now, I’m trying to obtain better performance using GPU. First, I wrote that kernel function :
[…]
//Host side
int threadsPerBlock = 256 ;
int blocksPerGrid_bpc =(bad_pixel + threadsPerBlock - 1) / threadsPerBlock ;
badPixelCorrection<<<blocksPerGrid_bpc, threadsPerBlock>>>(d_img_mean, d_map_pixel, bad_pixel, width) ;
[…]
//device side
global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;img[img_index] = quickSelect(neighborPixel, 9) ;
}
}
The execution time is around 200 ms, the CPU is faster. I think that it is a problem with memory allocation, because here I only use global memory. I tried this method, using share memory :
global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){
int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
shared float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;float windowMedian[9] = {neighborPixel[0], neighborPixel[1], neighborPixel[2], neighborPixel[3], neighborPixel[4], neighborPixel[5], neighborPixel[6], neighborPixel[7], neighborPixel[8]} ; img[pixel_map[index]] = quickSelect(windowMedian, 9) ;
}
}
That method is even longer.
Does someone has advice to perform that process ?
Many thanks !