horizontal lines after RGBtoRGBA and RGBAtoRGB in place algorigthm

JohnSmithX · December 22, 2010, 8:33am

Hi,

I’m using CUDA for image proccessing. The images are usually in RGB or BGR format (uchar3), which is useless for effective memory access (coalescing). At first I was converting images into RGBA format on CPU before sending them on graphic card. Baut this way is realy slow for big images (50-100ms). So I decided to use GPU for this. But the GPU have limit size of global memory and I want this algorithm to be in place.

I’m using cudaMemcpy2D function for copy RGB image on RGBA. RGB image is copied into RGBA memory size as RGBRGBRGBRGB…RGB000…000. I’m posting my code and the result image I’m geting. There you can see horizontal lines. I don;t understand why this is happening. I think that my process of converting RGB2RGBA and RGBA2RGB is ok. Why is this happening? THX

__global__ void kernel_rgb2rgbaI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[256];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

if (x < width)

  {

    unsigned char * row = (unsigned char *)&image[y*width]; 

x = width - 1 - x; // odzadu

    int idx = 3*x;

//uchar4 rgba = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

    shared[threadIdx.x] = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

    __syncthreads();

image[y*width + x] = shared[threadIdx.x];

  }

} // kernel_rgb2rgbaI()

__global__ void kernel_rgba2rgbI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[1024];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

if (x < width)

  {

    unsigned char * row = (unsigned char *)&image[y*width];

int idx = 3*x;

//uchar4 rgba = image[y*width + x];  

    shared[threadIdx.x] = image[y*width + x];

    __syncthreads();

//row[idx]   = rgba.x;

    //row[idx+1] = rgba.y;

    //row[idx+2] = rgba.z;

    row[idx]   = shared[threadIdx.x].x;

    row[idx+1] = shared[threadIdx.x].y;

    row[idx+2] = shared[threadIdx.x].z;

  }

} // kernel_rgba2rgbI()

extern "C" void rgb2rgbaI(uchar4 * image, const int width, const int height)

{

  //cudaFuncAttributes attr;

  //cudaError error = cudaFuncGetAttributes(&attr, kernel_rgb2rgba);

  int num_threads_per_block = 256;//attr.maxThreadsPerBlock;

dim3 block_size = dim3(num_threads_per_block, 1, 1);

  dim3 num_blocks = dim3((width + num_threads_per_block - 1) / num_threads_per_block, height, 1);

kernel_rgb2rgbaI<<<num_blocks, block_size>>>(image, width, height);

  cudaThreadSynchronize();

} // rgb2rgbaI()

extern "C" void rgba2rgbI(uchar4 * image, const int width, const int height)

{

  //cudaFuncAttributes attr;

  //cudaError error = cudaFuncGetAttributes(&attr, kernel_rgba2rgb);

  int num_threads_per_block = 256;//attr.maxThreadsPerBlock;

dim3 block_size = dim3(num_threads_per_block, 1, 1);

  dim3 num_blocks = dim3((width + num_threads_per_block - 1) / num_threads_per_block, height, 1);

kernel_rgba2rgbI<<<num_blocks, block_size>>>(image, width, height);

  cudaThreadSynchronize();

} // rgb2rgbaI()

uchar4 * input_d; 

cudaMalloc(&input_d, imagesize);

cudaMemcpy2D(input_d, 4*width, image->data(), 3*width, 3*width, height, cudaMemcpyHostToDevice); 

rgb2rgbaI(input_d, width, height);

// some image processing

rgba2rgbI(input_d, width, height);

cudaMemcpy2D(image->data(), 3*width, input_d, 4*width, 3*width, height, cudaMemcpyDeviceToHost);

External Media

tera · December 22, 2010, 11:42am

I haven’t really read your code, but __syncthreads() inside conditional code should be avoided, as it (officially) only works if the conditional evaluates the same for all threads. Try this:

__global__ void kernel_rgb2rgbaI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[256];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

if (x < width)

  {

    unsigned char * row = (unsigned char *)&image[y*width]; 

x = width - 1 - x; // odzadu

    int idx = 3*x;

//uchar4 rgba = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

    shared[threadIdx.x] = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

  }

__syncthreads();

if (x < width)

  {

    image[y*width + x] = shared[threadIdx.x];

  }

} // kernel_rgb2rgbaI()

__global__ void kernel_rgba2rgbI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[1024];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

  unsigned char * row = (unsigned char *)&image[y*width];

  int idx = 3*x;

if (x < width)

  {

    //uchar4 rgba = image[y*width + x];  

    shared[threadIdx.x] = image[y*width + x];

  }

__syncthreads();

if (x < width)

  {

    //row[idx]   = rgba.x;

    //row[idx+1] = rgba.y;

    //row[idx+2] = rgba.z;

    row[idx]   = shared[threadIdx.x].x;

    row[idx+1] = shared[threadIdx.x].y;

    row[idx+2] = shared[threadIdx.x].z;

  }

} // kernel_rgba2rgbI()

JohnSmithX · December 22, 2010, 12:00pm

I haven’t really read your code, but __syncthreads() inside conditional code should be avoided, as it (officially) only works if the conditional evaluates the same for all threads. Try this:

__global__ void kernel_rgb2rgbaI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[256];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

if (x < width)

  {

    unsigned char * row = (unsigned char *)&image[y*width]; 

x = width - 1 - x; // odzadu

    int idx = 3*x;

//uchar4 rgba = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

    shared[threadIdx.x] = make_uchar4(row[idx], row[idx+1], row[idx+2], 0);

  }

__syncthreads();

if (x < width)

  {

    image[y*width + x] = shared[threadIdx.x];

  }

} // kernel_rgb2rgbaI()

__global__ void kernel_rgba2rgbI(uchar4 * image, const int width, const int height)

{

  __shared__ uchar4 shared[1024];

int x = blockDim.x * blockIdx.x + threadIdx.x;

  int y = blockDim.y * blockIdx.y + threadIdx.y;

  unsigned char * row = (unsigned char *)&image[y*width];

  int idx = 3*x;

if (x < width)

  {

    //uchar4 rgba = image[y*width + x];  

    shared[threadIdx.x] = image[y*width + x];

  }

__syncthreads();

if (x < width)

  {

    //row[idx]   = rgba.x;

    //row[idx+1] = rgba.y;

    //row[idx+2] = rgba.z;

    row[idx]   = shared[threadIdx.x].x;

    row[idx+1] = shared[threadIdx.x].y;

    row[idx+2] = shared[threadIdx.x].z;

  }

} // kernel_rgba2rgbI()

THX for synchronization error, but it didn’t change the result. Still having horizontal lines. The lines seems to be cause by using same memory bank in theard-blocks.

Dittoaway · December 22, 2010, 3:02pm

I haven’t gone through all your code either but I do note that you check for the validity of x in your kernels but not y.

tera · December 22, 2010, 3:29pm

On a closer look I notice that you overwrite the image in-place. This is certainly going to fail as the order in which blocks execute is not predictable. Use a second buffer for the output of the conversion.

EDIT:

I’ve missed that sentence before. If you want to avoid the memory overhead of two buffers, you can copy you image between host and GPU with a kernel that uses zerocopy to access memory on the host and converts the format on-the-fly. This also eliminates the extra memory accesses on the device side, although their performance impact probably is negligible.

JohnSmithX · December 23, 2010, 7:29am

THX, zerocopy is pretty interesting idea

Topic		Replies	Views
Low performance and high CPU usage CUDA Programming and Performance	13	19458	August 7, 2007
Cuda demosaicing running slowly CUDA Programming and Performance	7	5692	November 16, 2011
Re_arranging Cuda array CUDA Programming and Performance	8	49	September 23, 2024
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
CUDA performance How to improve my code? CUDA Programming and Performance	16	13574	June 29, 2010
Global Memory Reading Problem CUDA Programming and Performance	4	1406	October 10, 2009
Unable to do image processing algorithms on GPU CUDA Programming and Performance cuda , kernel , gpu	5	757	November 20, 2022
Performance issues on memory transfer CUDA Programming and Performance	13	12982	November 26, 2010
Memory access and threading problems? CUDA Programming and Performance	0	701	July 20, 2009
cudaMemcpy() returns success and copy incorrect data CUDA Programming and Performance	3	2160	March 4, 2017

horizontal lines after RGBtoRGBA and RGBAtoRGB in place algorigthm

Related topics