Hi all,
i would like to know your opinion about two different programming strategies of kernel.
Let’s assume that I have a 2D data structure and I need to perform some operations on each element of the matrix. Operations are independent and do not need access to other data of the structure. For instance I want to evaluate the square radix of each element of the matrix. One would allocate a 2D grid of blocks of threads, with each thread operating on a matrix cell.
My aim is to write cuda kernel code that is as much as possible executed by threads permorming the same operations and hence exploiting coalescent memory access and pipelines.
My doubt is what is the best strategy to follow? I mean…I individuated 2 different ways (but maybe there are others) to achieve the previous goal.
I) once defined the BLOCK_SIZE linear dimension (2D block has BLOCK_SIZExBLOCK_SIZE threads), pad the dimension of structures in order to be a multiple of BLOCK_SIZE and write code avoiding to use if conditions in order to get a “uniform” code that performs useless operations on pad data, but in a parallel fashion such that the overall performances are improved. Further, as the block size is kept small (8,12,16) the padding requires small amount of data to be added that with the increasing size of data structures can become negligible (and hence also the “fake” operations on them).
II) define a fixed BLOCK_SIZE as in the strategy I, and avoid padding data, by introducing if statement that control the thread flow. If data structure do not fit into a multiple of BLOCK_SIZE, I allocate a grid of threads bigger (the smallest containing the structure) and I can use IF statement to avoid segmentation fault of threads accessing to the cell that are out of my data strucute. The code should be something like that
Let’s assume that my 2d matrix is MxN sized.
int ind_col=threadIdx.x+blockIdx.xBLOCK_SIZE;
int ind_row=threadIdx.y+blockIdx.yBLOCK_SIZE;
if (ind_col<N & ind_row<M) {
…perform operations}
else {return;}
What is the best way? Does strategy II exclude coalescent access to memory as there is a IF statement?
I will appreciate any comment about.
Thank you,
P.