branch in kerernel or padding data

Hi all,

i would like to know your opinion about two different programming strategies of kernel.

Let’s assume that I have a 2D data structure and I need to perform some operations on each element of the matrix. Operations are independent and do not need access to other data of the structure. For instance I want to evaluate the square radix of each element of the matrix. One would allocate a 2D grid of blocks of threads, with each thread operating on a matrix cell.

My aim is to write cuda kernel code that is as much as possible executed by threads permorming the same operations and hence exploiting coalescent memory access and pipelines.

My doubt is what is the best strategy to follow? I mean…I individuated 2 different ways (but maybe there are others) to achieve the previous goal.

I) once defined the BLOCK_SIZE linear dimension (2D block has BLOCK_SIZExBLOCK_SIZE threads), pad the dimension of structures in order to be a multiple of BLOCK_SIZE and write code avoiding to use if conditions in order to get a “uniform” code that performs useless operations on pad data, but in a parallel fashion such that the overall performances are improved. Further, as the block size is kept small (8,12,16) the padding requires small amount of data to be added that with the increasing size of data structures can become negligible (and hence also the “fake” operations on them).

II) define a fixed BLOCK_SIZE as in the strategy I, and avoid padding data, by introducing if statement that control the thread flow. If data structure do not fit into a multiple of BLOCK_SIZE, I allocate a grid of threads bigger (the smallest containing the structure) and I can use IF statement to avoid segmentation fault of threads accessing to the cell that are out of my data strucute. The code should be something like that

Let’s assume that my 2d matrix is MxN sized.

int ind_col=threadIdx.x+blockIdx.xBLOCK_SIZE;
int ind_row=threadIdx.y+blockIdx.y
if (ind_col<N & ind_row<M) {
…perform operations}
else {return;}

What is the best way? Does strategy II exclude coalescent access to memory as there is a IF statement?

I will appreciate any comment about.
Thank you,


I think the most common answer (and probably the right one) would be: test both methods.

The most important thing to look is that your code is indeed coalesced. On current/high compute capability

things have been more relaxed so if a thread decides to diverge and not access gmem no harm will be caused.

Personally i’ve used padding a lot, sometimes it helped and sometimes didnt change.

As for divergance, again other than running the code and evaluating whether the performance is good for you or not,

you can run the code via the profiler. If the % of the divergant threads are low than its not an issue.

You can also try to use textures if you fear/think your code won’t/can’t be coalseced.

The most important thing is to try both methods and see if you’re happy with their performance. If both methods

are not good enough- maybe there is another approch and you’ll probably want to check that method as well

Thats what I like most in CUDA (well beside the x60 factor i get :) )

hope it helps