Hi, I have a kernel where each Thread performs some calculation on one element of 2D matrix e.g. A(width, height) loaded into the shared memory.
The threads working on the elements located on the array boundaries (first column, last column, first row and last row) are performing calculations different from the threads working on the “inner” elements.
The current implementation loads 1 Element/Thread and then tests thread and block ids to check if they fail into these regions, and if yes, a different calculation is performed for each case.
This works correctly, but slowly, so I’d have a couple of optimization questions:
what is a recommended way for optimizing the branching cases?
Can somebody please explain the example with branch divergence from Hwu’s CUDA lectures (Lecture 10, Slide5)? How does the “if (threadId.x>2) then I1 else I2” gets optimized by adding /WARP_SIZE? As far as I can understand “if (threadId.x/WARP_SIZE>2) then I1 else I2”, will force first 64 Threads in the block to execute I1, and the rest of threads to execute I2. But that isn’t really equivalent to the “if (threadId.x>2) then I1 else I2” where only 2 Threads execute I1, is it?
How to optimize it in the case where really only 1 or 2 Threads should satisfy the condition?
Is it possible to “force” certain threads to go into a certain warp? e.g If I could put all boundary elements of the matrix to be handled by the threads belonging to same warp (or multiple warps depending on the size of the matrix)
How to check if the predication really takes place?
What would be a recommended method for performing calculations that involve access to the adjacent elements on large arrays?E.g. array size is such that the number of elements doesn’t permit putting whole array in 1 single block and thus having fully shared memory between threads due to the SMEM partitioning.
Is there a simple way for exchanging halo elements in different blocks like with MPI?
Many thanks in advance for quick replies!