Help in thread divergene

Hi all,

I’m working on a very simple image processing application involving image binarization.

I would like to know if anyone can help me with warp divergence. The problem is that the algorithm is very very simple: It just tests a the pixel and checks if its value is greater or not than a threshold (0 - 255). I don’t what how to solve this warp divergence.

[codebox]global void binarizeImageGPUKernel( uint1 *image, unsigned width,

										unsigned height, size_t pitch, unsigned char threshold )

{

//thread index

unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

//pixel access

uint1 imagePixel = image[idx];

if( imagePixel.x < threshold )

	imagePixel.x = 0;

else

	imagePixel.x = 255;

//write back to matrix

image[idx] = imagePixel;

}[/codebox]

What can I do to eliminate the divergence and to improve the kernel?

I’m also aware of the low floating-point : global memory accesses ratio

PS: I have already took care of memory coalescing

Thanks in advance!

I think that you can try prefetching

// 256 is not a magic number, you can choose any number you want

// depends on how you organize your thread-blocks, grids

uint1 imagePixel_1 = image[idx];

uint1 imagePixel_2 = image[idx+256];

uint1 imagePixel_3 = image[idx+512];

uint1 imagePixel_4 = image[idx+1024];

imagePixel_1.x = ( imagePixel_1.x < threshold )? 0:	255;

imagePixel_2.x = ( imagePixel_2.x < threshold )? 0:	255;

imagePixel_3.x = ( imagePixel_3.x < threshold )? 0:	255;

imagePixel_4.x = ( imagePixel_4.x < threshold )? 0:	255;

image[idx] = imagePixel_1;

image[idx+256] = imagePixel_2;

image[idx+512] = imagePixel_3;

image[idx+1024] = imagePixel_4;

[quote name=‘LSChien’ post=‘1093144’ date=‘Jul 25 2010, 02:35 AM’]

I think that you can try prefetching

[codebox]__global__ void binarizeImageGPUKernel( uint1 *image, unsigned width,
										unsigned height, size_t pitch, unsigned char threshold )

{

unsigned tx = threadIdx.x;

unsigned idx = blockDim.x * blockIdx.x + tx;

if( tx < BLOCK_SIZE/4 )

{

	uint1 imagePixel_1 = image[idx];

	uint1 imagePixel_2 = image[idx+32];

	uint1 imagePixel_3 = image[idx+64];

	uint1 imagePixel_4 = image[idx+96];

	imagePixel_1.x = ( imagePixel_1.x < threshold ) ? 0 : 255;

	imagePixel_2.x = ( imagePixel_2.x < threshold ) ? 0 : 255;

	imagePixel_3.x = ( imagePixel_3.x < threshold ) ? 0 : 255;

	imagePixel_4.x = ( imagePixel_4.x < threshold ) ? 0 : 255;

	image[idx] = imagePixel_1;

	image[idx+32] = imagePixel_2;

	image[idx+64] = imagePixel_3;

	image[idx+96] = imagePixel_4;

}

}[/codebox]

The memory transactions still coalesces, but the divergent branches are the same too.

Any more ideas?

Thanks again!

What’s your blocksize? What’s the compute capability of your device?

How do you deduce you have divergent threads? If you look at the resulting .ptx file of run the .cubin through decuda

// Disassembling _Z22binarizeImageGPUKernelP5uint1jjmh

000000: 10000205 40004780 mov.u16 $r0.hi, %ntid.y

000008: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo

000010: 60014c01 00204780 mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1

000018: 30020001 c4100780 shl.u32 $r0, $r0, 0x00000002

000020: d0004005 20000780 mov.b32 $ofs1, 0x00000020

000028: 2000c805 04200780 add.u32 $r1, s[0x0010], $r0

000030: a4000009 04208780 cvt.rn.u32.u8 $r2, s[$ofs1+0x0000]

000038: d00e0201 80c00780 mov.u32 $r0, g[$r1]

000040: 30000401 64010780 set.gt.u32 $r0, $r2, $r0

000048: d0800001 04410780 norn.b32 $r0, $r0, c1[0x0000]

000050: d00e0201 a0c00781 mov.end.u32 g[$r1], $r0

// segment: const (0:0000)

0000: 0000000f

you see there isn’t even a branch at all.

Also, as you already notice, increasing the speed of the thresholding would not make your kernel any faster, as it is entirely bound by memory bandwidth. To improve that, you can use wider memory transactions by using an uint4 to access four pixels at once. Further improvement is possible if you can store the image as uchar and then read it via uchar4.

LSChien’s suggestion of prefetching might also improve things a bit. Your implementation however counteracts all effects through the conditional [font=“Courier New”]if( tx < BLOCK_SIZE/4 )

[/font]. This leaves three quarters of the threads unused, and as a result you have just as many outstanding memory transactions as in your previous code. Remove that test, and just divide BLOCK_SIZE by 4 instead (adjusting idx, of course).

Tera

My blocksize was 128 and the compute capability 1.1 (GTS 250)

I profiled the code through Visual Profiler and looked the divergent branch column. The number I saw was the same on both version of the kernel, so I thought that no improvement was achieved concerning the divergent threads.

I thing I forgot to mention: will memory still coalesces if I store image as uchar1 and read as uchar4?

And I noticed that 3/4 of my thread became idle after the test ( if( tx < BLOCK_SIZE/4 ) ). Sorry about that, my mistake. I will try making these changes.

Thank you

Yes they will. It only matters how you access the memory, not how it was originally declared.

Hi again.

Tera

By doing what you suggested I got an improvement of about 2.5 - 3 times (0.31ms x 0.11ms)

I stored the image on the host as a uchar1 and on the device as uchar4, right?

I’m going to try what LSChien suggested, now.

Thank you again

[codebox]global void binarizeImageGPUKernel( uchar4 *image, unsigned width,

										unsigned height, unsigned char threshold )

{

unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;



uchar4 pixel4 = image[idx];

pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

image[idx] = pixel4;

}[/codebox]

EDIT: After the prefetch:

[codebox] #define BLOCK_SIZE 128

global void binarizeImageGPUKernel( uchar4 *image, unsigned width,

										unsigned height, unsigned char threshold )

{

unsigned idxImage = threadIdx.x + blockIdx.x*512;



uchar4 pixel4_1 = image[idxImage];

uchar4 pixel4_2 = image[idxImage+128];

uchar4 pixel4_3 = image[idxImage+256];

uchar4 pixel4_4 = image[idxImage+384];

pixel4_1.x = ( pixel4_1.x < threshold ) ? 0 : 255;

pixel4_1.y = ( pixel4_1.y < threshold ) ? 0 : 255;

pixel4_1.z = ( pixel4_1.z < threshold ) ? 0 : 255;

pixel4_1.w = ( pixel4_1.w < threshold ) ? 0 : 255;

pixel4_2.x = ( pixel4_2.x < threshold ) ? 0 : 255;

pixel4_2.y = ( pixel4_2.y < threshold ) ? 0 : 255;

pixel4_2.z = ( pixel4_2.z < threshold ) ? 0 : 255;

pixel4_2.w = ( pixel4_2.w < threshold ) ? 0 : 255;

pixel4_3.x = ( pixel4_3.x < threshold ) ? 0 : 255;

pixel4_3.y = ( pixel4_3.y < threshold ) ? 0 : 255;

pixel4_3.z = ( pixel4_3.z < threshold ) ? 0 : 255;

pixel4_3.w = ( pixel4_3.w < threshold ) ? 0 : 255;

pixel4_4.x = ( pixel4_4.x < threshold ) ? 0 : 255;

pixel4_4.y = ( pixel4_4.y < threshold ) ? 0 : 255;

pixel4_4.z = ( pixel4_4.z < threshold ) ? 0 : 255;

pixel4_4.w = ( pixel4_4.w < threshold ) ? 0 : 255;

image[idxImage] = pixel4_1;

image[idxImage+128] = pixel4_2;

image[idxImage+256] = pixel4_3;

image[idxImage+384] = pixel4_4;

}[/codebox]

The execution time didn’t change at all. I think that is because the maximum occupancy achievable is 67% (14 registers limitation according to CUDA occupancy calculator).

In order to reduce the number of registers used I tried to wrap these calculation in a for loop, but it didn’t help either. In fact, it took 1-2 more ms . And finally, I tried to unroll (manually) the loop but nothing… codes below:

[codebox]#define BLOCK_SIZE 128

global void binarizeImageGPUKernel( uchar4 *image, unsigned width,

										unsigned height, unsigned char threshold )

{

unsigned idxImage = threadIdx.x + blockIdx.x*512;

uchar4 pixel4;

for( unsigned i = 0; i < 4; i++ )

{

	pixel4 = image[idxImage+i*128];

	pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

	pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

	pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

	pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

	image[idxImage+i*128] = pixel4;

}

}[/codebox]

[codebox]#define BLOCK_SIZE 128

global void binarizeImageGPUKernel( uchar4 *image, unsigned width,

										unsigned height, unsigned char threshold )

{

unsigned idxImage = threadIdx.x + blockIdx.x*512;

uchar4 pixel4 = image[idxImage];



pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

image[idxImage] = pixel4;

pixel4 = image[idxImage+128];



pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

image[idxImage+128] = pixel4;

pixel4 = image[idxImage+256];



pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

image[idxImage+256] = pixel4;

pixel4 = image[idxImage+384];



pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

image[idxImage+384] = pixel4;

}[/codebox]

I think the codes above are correct, right? (please any mistake, correct me).

Is there any other trick I can do to improve performance?

Thank yoy

I’d guess that you’ve just hit the optimum already. What is the effective memory bandwidth you achieve?

You mean the calculated effective bandwidth or measured by a profiler?

EDIT:

By using the formula suggested by the CUDA best practices:

Effective bandwidth =((Br + Bw)/10^9)/time = (1600x1200 x 2)/10^9)/(0.11827x10^-3) = 32.4567 GBps

1600x1200 → number of pixels (image resolution, bytes) read

0.11827x10^-3 → execution time in miliseconds

And according to the specs the GTS 250 has a maximum theoretical bandwidth of 70.4 GB/s or 65,565 GBps

Is that what I should expect ?

PS: Visual Profiler in compute capability 1.1 does not support glob mem read throughput nor glob mem write throughput attributes.

Thank you

Do you test your code on GT200?

or you can provide your code and I can test it on my machine.

If your GTS250 has 2000MHz memory speed, then you only reach 50% of peak bandwidth, it is not high.

I want to double-check its performance on GT200.

If it also reaches 50% peak bandwidth on GT200, then we should try to dig out some secret.

Yes, I can provide my source code (attached below), but be advised that I was doing it in the path “recommended” by NVIDIA, which is C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\src). If you change the path I don’t know if you will be able to run the code without some modifications.

Also, I’m using openCV to load the images. Is there any other easier way to do that, besides using cutil loadPPM?

One last thing: how can I measure execution time in C code? I’ve found lots of ways, but I don’t know if they are accurate like the events in CUDA (0.5 microseconds). I’m using cutil timers until I find a better way to do that.

Thanks
image_bin_v5.rar (6.15 MB)

Has anyone been able to run the project?

You might create more resonance by not using a proprietary commercial archive format.

Hi all again.

I was able to test the kernel with prefetching on a GTX 295. After I executed the code through the Visual profiler I went to View → Summary Table.

[codebox] #define BLOCK_SIZE 128

global void binarizeImageGPUKernel( uchar4 *image, unsigned width,

unsigned height, unsigned char threshold )

{

unsigned idxImage = threadIdx.x + blockIdx.x*512;

uchar4 pixel4_1 = image[idxImage];

uchar4 pixel4_2 = image[idxImage+128];

uchar4 pixel4_3 = image[idxImage+256];

uchar4 pixel4_4 = image[idxImage+384];

pixel4_1.x = ( pixel4_1.x < threshold ) ? 0 : 255;

pixel4_1.y = ( pixel4_1.y < threshold ) ? 0 : 255;

pixel4_1.z = ( pixel4_1.z < threshold ) ? 0 : 255;

pixel4_1.w = ( pixel4_1.w < threshold ) ? 0 : 255;

pixel4_2.x = ( pixel4_2.x < threshold ) ? 0 : 255;

pixel4_2.y = ( pixel4_2.y < threshold ) ? 0 : 255;

pixel4_2.z = ( pixel4_2.z < threshold ) ? 0 : 255;

pixel4_2.w = ( pixel4_2.w < threshold ) ? 0 : 255;

pixel4_3.x = ( pixel4_3.x < threshold ) ? 0 : 255;

pixel4_3.y = ( pixel4_3.y < threshold ) ? 0 : 255;

pixel4_3.z = ( pixel4_3.z < threshold ) ? 0 : 255;

pixel4_3.w = ( pixel4_3.w < threshold ) ? 0 : 255;

pixel4_4.x = ( pixel4_4.x < threshold ) ? 0 : 255;

pixel4_4.y = ( pixel4_4.y < threshold ) ? 0 : 255;

pixel4_4.z = ( pixel4_4.z < threshold ) ? 0 : 255;

pixel4_4.w = ( pixel4_4.w < threshold ) ? 0 : 255;

image[idxImage] = pixel4_1;

image[idxImage+128] = pixel4_2;

image[idxImage+256] = pixel4_3;

image[idxImage+384] = pixel4_4;

}[/codebox]

The global mem overall (read + write) throughput = 87.8396GB/s which is far away from the peak bandwidth (223.8 GB/s according to the NVIDIA specifications).

Should I try to dig some more in order to increase bandwidth usage? Or will it be a waste of time?

PS: just a performance metric: with an image of 3300x2400 resolution I got a speedup of around 66 times in comparison with the C version. Not as good as a 300 speedup as we see in some places, but it’s nice already, isn’t it?

Thanks

Hello,

What is exactly prefetching here ? I really don’t understand why your piece of code is good… !

Has CUDA a special way to handle the ternary operator " ? : "

(where to find technical information about this technique ?)

Thank you very much !

Try to also time the code without the profiler. The profiler might artificially slow down the code.

I run your code on my machine, TeslaC1060 and GTX295, both reach 46GB/s (effective bandwidth).

In fact, bandwidth of GTX295 is 112GB/s, not 223.8GB/s because you only use one GPU of GTX295.

Hi LSChien

But You got 46GB/s for read, write or overall effective bandwidth?

Because I got that number (87.8396GB/s) for the overall bandwidth, which is the sum of the write bandwidth + read bandwidth.

Thank you for giving the trouble to run my code on your machine.

hi all again,

I managed to make some modifications to my kernel in order to use textures, but I didn’t notice any improvement.

The original image is allocated as uchar1. Later I declared a cuda array (for texture use) and copied the data (image) from image to cuArray and finnaly I bound the cuda array with the texture reference texRef. image_out is the image linearized.

My question is that if I should notice an improvement against the kernel using regular global memory access or not. And I am not completely sure I understood texture management correctly. Did I use it correctly?

If anyone could run this new kernel on a 1.3 compute capability to see if the bandwidth has increase I would really appreciate.

Thank you.

[codebox]#define BLOCK_SIZE 16

texture< uchar4, 2, cudaReadModeElementType > texRef;

global void binarizeImageGPUKernel( uchar4 *image_out, size_t devicePitch, unsigned width,

								   unsigned height, unsigned char threshold )

{

unsigned idx = threadIdx.x + blockDim.x * blockIdx.x;

unsigned idy = threadIdx.y + blockDim.y * blockIdx.y;

if( idx < width/4 && idy < height )

{

	uchar4 pixel4 = tex2D( texRef, idx, idy );

	uchar4 *pixel4_out = (uchar4 *)((char *)image_out + devicePitch*idy);

	pixel4.x = ( pixel4.x < threshold ) ? 0 : 255;

	pixel4.y = ( pixel4.y < threshold ) ? 0 : 255;

	pixel4.z = ( pixel4.z < threshold ) ? 0 : 255;

	pixel4.w = ( pixel4.w < threshold ) ? 0 : 255;

	pixel4_out[idx] = pixel4;

}

}

void binarizeImageGPU( uchar1 *image, unsigned width,

						unsigned height, unsigned hostPitch, unsigned char threshold )

{

unsigned widthInBytes = width*sizeof(uchar1);

float elapsedTime; 

size_t devicePitch;

uchar4 *image_out = NULL;

cudaEvent_t start, stop; 

cudaEventCreate(&start); 

cudaEventCreate(&stop); 

dim3 blockSize( BLOCK_SIZE, BLOCK_SIZE, 1 );

dim3 gridSize( ((width + BLOCK_SIZE - 1)/BLOCK_SIZE)/4, ((height + BLOCK_SIZE - 1)/BLOCK_SIZE), 1 );

cudaChannelFormatDesc channelDesc =

           cudaCreateChannelDesc(8, 8, 8, 8, 

                                 cudaChannelFormatKindUnsigned); 

cudaArray* cuArray;

cudaMallocArray(&cuArray, &channelDesc, widthInBytes, height); 

cudaMemcpy2DToArray(cuArray, 0, 0, image, hostPitch, 

	widthInBytes, height, cudaMemcpyHostToDevice); 

texRef.addressMode[0] = cudaAddressModeClamp; 

texRef.addressMode[1] = cudaAddressModeClamp; 

texRef.filterMode     = cudaFilterModePoint; 

texRef.normalized     = false; 

cudaBindTextureToArray(texRef, cuArray, channelDesc); 

cudaMallocPitch( (void **)&image_out, &devicePitch, widthInBytes, height );

cudaEventRecord(start, 0); 

binarizeImageGPUKernel<<< gridSize, blockSize >>>( image_out, devicePitch, width, height, threshold );

cudaThreadSynchronize();

cudaEventRecord(stop, 0); 

cudaEventSynchronize(stop); 



cudaEventElapsedTime(&elapsedTime, start, stop); 



cudaMemcpy2D( image, hostPitch, image_out, devicePitch,

	widthInBytes, height, cudaMemcpyDeviceToHost);

printf( "Tempo de execucao na GPU sem otimizacao do compilador CUDA /Od: %f ms\n", elapsedTime );

cudaFreeArray(cuArray); 

cudaFree( image_out );

cudaThreadExit();

}[/codebox]