Cuda programming

Hi,

I need expert cuda programming. I am working on jetson TX2.

I have an image of 48Mp (6004*7920), and a map which indicates bad pixel. The number of bad pixel is around 2% of the total image, that means around 951033.

I handle image using indixation. My image is just 1D-array of 48000000 elements (float), and I handle col and line like that :

for(int i = 0 ; i<height ; i++){
      for(int j = 0 ; j<width ; j++){
            index = i*width + j ; // line i and col j
      }
}

Thanks to the map, I know which pixels are bad. Here, I have all the bad pixels :

      for(int j = 0 ; j<numerBadPixel ; j++){
            cout << img[map[j]] << endl ; ; // all bad pixel
      }

For each bad pixel pixels, I apply 3*3 median filter. Note that I am not interesting by the edge and the corner.
I’ve implemented a code on the CPU, and the execution time is around 80 ms (using multithreading). Now, I’m trying to obtain better performance using GPU. First, I wrote that kernel function :

[…]
//Host side
int threadsPerBlock = 256 ;
int blocksPerGrid_bpc =(bad_pixel + threadsPerBlock - 1) / threadsPerBlock ;
badPixelCorrection<<<blocksPerGrid_bpc, threadsPerBlock>>>(d_img_mean, d_map_pixel, bad_pixel, width) ;
[…]

//device side
global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){

int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;

img[img_index] = quickSelect(neighborPixel, 9) ;

}

}

The execution time is around 200 ms, the CPU is faster. I think that it is a problem with memory allocation, because here I only use global memory. I tried this method, using share memory :

global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){

int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
shared float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;

  float windowMedian[9] = {neighborPixel[0], neighborPixel[1], neighborPixel[2],
  						 neighborPixel[3], neighborPixel[4], neighborPixel[5],
  						 neighborPixel[6], neighborPixel[7], neighborPixel[8]} ;

  img[pixel_map[index]] = quickSelect(windowMedian, 9) ;

}

}

That method is even longer.

Does someone has advice to perform that process ?

Many thanks !

Hi,

Sorry for the late reply.
May I know what kind of memory type for img and map buffer?

Are you using a unified memory?
Thanks.

Hi,

Thank you for your answer.
Yes, I tested unified memory and zero-copy memory. The calculation time is the same.

My new code, which allow to some test :

global
void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){
int index = blockDim.x * blockIdx.x + threadIdx.x ;
if(index < bad_pixel){
const int img_index = pixel_map[index] ;
float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;
float temp ;
for (int i = 0; i < 5; i++)
{
for(int j = 0; j < 7 - i ; j++)
{
if (neighborPixel[j] > neighborPixel[j+1])
{
temp = neighborPixel[j];
neighborPixel[j] = neighborPixel[j+1];
neighborPixel[j+1] = temp;
}
}
}
img[img_index] = neighborPixel[4] ;
}

The problem from the last line :

img[img_index] = neighborPixel[4] ;

It takes a long time to write in img, but it is very fast to read. Fill the array neighborPixel, and select the median value don’t take a lot of time.

I stop to use share memory because nothing justify to use it.

If I comment the last line, the execution time of the kernel is about only some ms. I am surprised because it is very fast, but I obtain the right median value.

I tried to fill img[pixel_map[index]] with a random value, 55 for example. The execution time is around 40ms.

Thank you for your help

Hi,

Sorry for the late update.

To give a further suggestion, could you profile your data with nvprof first?
The profiling data can help us to figure out the bottleneck of this issue.

More, it looks like the memory access is the key of your implementation.
Please noticed that the memory access pattern may cause different performance also.
https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-fortran-kernels/

Thanks.