Cuda programming

anon79190853 · April 22, 2020, 5:11am

Hi,

I need expert cuda programming. I am working on jetson TX2.

I have an image of 48Mp (6004*7920), and a map which indicates bad pixel. The number of bad pixel is around 2% of the total image, that means around 951033.

I handle image using indixation. My image is just 1D-array of 48000000 elements (float), and I handle col and line like that :

for(int i = 0 ; i<height ; i++){
      for(int j = 0 ; j<width ; j++){
            index = i*width + j ; // line i and col j
      }
}

Thanks to the map, I know which pixels are bad. Here, I have all the bad pixels :

      for(int j = 0 ; j<numerBadPixel ; j++){
            cout << img[map[j]] << endl ; ; // all bad pixel
      }

For each bad pixel pixels, I apply 3*3 median filter. Note that I am not interesting by the edge and the corner.
I’ve implemented a code on the CPU, and the execution time is around 80 ms (using multithreading). Now, I’m trying to obtain better performance using GPU. First, I wrote that kernel function :

[…]
//Host side
int threadsPerBlock = 256 ;
int blocksPerGrid_bpc =(bad_pixel + threadsPerBlock - 1) / threadsPerBlock ;
badPixelCorrection<<<blocksPerGrid_bpc, threadsPerBlock>>>(d_img_mean, d_map_pixel, bad_pixel, width) ;
[…]

//device side
global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){

int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;

img[img_index] = quickSelect(neighborPixel, 9) ;

}

}

The execution time is around 200 ms, the CPU is faster. I think that it is a problem with memory allocation, because here I only use global memory. I tried this method, using share memory :

global void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){

int index = blockDim.x * blockIdx.x + threadIdx.x ;
int img_index = pixel_map[index] ;
if(index < bad_pixel)
{
shared float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;
  float windowMedian[9] = {neighborPixel[0], neighborPixel[1], neighborPixel[2],
  						 neighborPixel[3], neighborPixel[4], neighborPixel[5],
  						 neighborPixel[6], neighborPixel[7], neighborPixel[8]} ;

  img[pixel_map[index]] = quickSelect(windowMedian, 9) ;
}

}

That method is even longer.

Does someone has advice to perform that process ?

Many thanks !

AastaLLL · April 30, 2020, 3:32am

Hi,

Sorry for the late reply.
May I know what kind of memory type for img and map buffer?

Are you using a unified memory?
Thanks.

anon79190853 · April 30, 2020, 6:48am

Hi,

Thank you for your answer.
Yes, I tested unified memory and zero-copy memory. The calculation time is the same.

My new code, which allow to some test :

global
void badPixelCorrection(float* img, const int* pixel_map, const int bad_pixel, const int width){
int index = blockDim.x * blockIdx.x + threadIdx.x ;
if(index < bad_pixel){
const int img_index = pixel_map[index] ;
float neighborPixel[9] ;
neighborPixel[0] = img[img_index-width] ;
neighborPixel[1] = img[img_index-width-1] ;
neighborPixel[2] = img[img_index-width+1] ;
neighborPixel[3] = img[img_index+width] ;
neighborPixel[4] = img[img_index+width-1] ;
neighborPixel[5] = img[img_index+width+1] ;
neighborPixel[6] = img[img_index] ;
neighborPixel[7] = img[img_index-1] ;
neighborPixel[8] = img[img_index+1] ;
float temp ;
for (int i = 0; i < 5; i++)
{
for(int j = 0; j < 7 - i ; j++)
{
if (neighborPixel[j] > neighborPixel[j+1])
{
temp = neighborPixel[j];
neighborPixel[j] = neighborPixel[j+1];
neighborPixel[j+1] = temp;
}
}
}
img[img_index] = neighborPixel[4] ;
}

The problem from the last line :

img[img_index] = neighborPixel[4] ;

It takes a long time to write in img, but it is very fast to read. Fill the array neighborPixel, and select the median value don’t take a lot of time.

I stop to use share memory because nothing justify to use it.

If I comment the last line, the execution time of the kernel is about only some ms. I am surprised because it is very fast, but I obtain the right median value.

I tried to fill img[pixel_map[index]] with a random value, 55 for example. The execution time is around 40ms.

Thank you for your help

AastaLLL · May 13, 2020, 9:19am

Hi,

Sorry for the late update.

To give a further suggestion, could you profile your data with nvprof first?
The profiling data can help us to figure out the bottleneck of this issue.

More, it looks like the memory access is the key of your implementation.
Please noticed that the memory access pattern may cause different performance also.
https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-fortran-kernels/

Thanks.

Topic		Replies	Views
Image processing with CUDA: design question. CUDA Programming and Performance	5	1014	January 26, 2018
GPU data speed Jetson TX2 cuda	8	1001	October 18, 2021
Why is CUDA tex2D extremely slow on TX2? Jetson TX2	8	797	October 18, 2021
Array Problem CUDA Programming and Performance	5	1462	January 27, 2010
How to improve access to global memory? CUDA Programming and Performance	1	535	December 14, 2017
Array + Array (1D or 2D): Why is performance of my code TERRIBLE? CUDA Programming and Performance cuda , image-processing	6	72	October 21, 2024
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
CUDA kernel runs slower when I changed from float to uint8_t CUDA Programming and Performance	3	65	August 13, 2024
Concurrent CPU and GPU processing Jetson TX1	12	1649	October 18, 2021
Some newbie questions CUDA Programming and Performance	2	1260	April 1, 2010

Cuda programming

Related topics