Hi, I have a kernel where each Thread performs some calculation on one element of 2D matrix e.g. A(width, height) loaded into the shared memory.
The threads working on the elements located on the array boundaries (first column, last column, first row and last row) are performing calculations different from the threads working on the “inner” elements.
The current implementation loads 1 Element/Thread and then tests thread and block ids to check if they fail into these regions, and if yes, a different calculation is performed for each case.
This works correctly, but slowly, so I’d have a couple of optimization questions:
what is a recommended way for optimizing the branching cases?
Can somebody please explain the example with branch divergence from Hwu’s CUDA lectures (Lecture 10, Slide5)? How does the “if (threadId.x>2) then I1 else I2” gets optimized by adding /WARP_SIZE? As far as I can understand “if (threadId.x/WARP_SIZE>2) then I1 else I2”, will force first 64 Threads in the block to execute I1, and the rest of threads to execute I2. But that isn’t really equivalent to the “if (threadId.x>2) then I1 else I2” where only 2 Threads execute I1, is it?
How to optimize it in the case where really only 1 or 2 Threads should satisfy the condition?
Is it possible to “force” certain threads to go into a certain warp? e.g If I could put all boundary elements of the matrix to be handled by the threads belonging to same warp (or multiple warps depending on the size of the matrix)
How to check if the predication really takes place?
What would be a recommended method for performing calculations that involve access to the adjacent elements on large arrays?E.g. array size is such that the number of elements doesn’t permit putting whole array in 1 single block and thus having fully shared memory between threads due to the SMEM partitioning.
Is there a simple way for exchanging halo elements in different blocks like with MPI?
How do you define slowly in “works correctly, but slowly”? Are you running the same kernel on a dataset padded so that there are no boundary conditions?
To answer “- what is a recommended way for optimizing the branching cases?”
Other users on these forums, myself included have reported immeasurable performance differences when trying to optimize boundary condition warp divergences. The hardware handles them very well on its own and any junk you add to try and avoid them tends to add additional overhead.
“- How to check if the predication really takes place?”
Compile with the -keep option and examine the ptx. There is an option to include C code in comments in the ptx to help identify code, though the name of the option escapes me at the moment. Note that I have never seen CUDA 1.0 nvcc generate a real predicated instruction. I haven’t checked in CUDA 1.1.
“- Is there a simple way for exchanging halo elements in different blocks like with MPI?”
No
The execution time of the kernel is currently slower than that of our CPU reference code, even without considering the time for sending back the calculation results to the host.
Major part of the kernel is determining where the element that is calculated is logically located. Based on its location, different calculations are conducted. That’s why I thought that these if statements are causing the kernel to perform slowly, especially after hearing Hwu’s talk.
The padded array is next to try. I suppose it should reduce considerably the kernel execution time, as there will be fewer sequential loads from global memory to the shared memory. That has been actually my first idea, but it looked like the indexing will become rather awkward if we use 18x18 threads for loading and performing calculations on 16x16 elements from 2D array per block + boundary elements from adjacent blocks. Is there maybe already some sample code for padded 2D arrays online? :)
I only asked that question because I doubt that removing the boundary checks will improve your performance significantly. There must be something else that is dropping your performance. By far, the most common culprit is memory access patterns. Are your memory reads coalesced? That can change performance by a factor of 20. Given your boundary access pattern, a 2D texture may be the best option. Also, what is the register/local mem usage of your kernel? Any spillover to local memory can drop your performance significantly.
I always check a kernels performance first by counting the total number of bytes read and written to/from global memory and divide by the execution time. 70GiB/s is achievable even with a semi-random access pattern. Unless the kernel is so small (i.e it executes in ~10us) that the driver overhead dominates, you should expect similar performance… unless of course you are limited by FLOPs.
You can use the profiler with the options file to see how many of your reads & writes are coalesced and how many are uncoalesced. It is very informative!
Ok, so I got the padded version working correctly as well. I’ve used larger number of threads, so each thread should be reading 1 element from the GM to the shared mem, and if not boundary additional 1 to a register (required by calculation). Still, this either works slow or I’m not measuring the time correctly, e.g. average kernel execution time is 0.266235ms compared to 0.04ms that it takes for CPU reference solution to complete.
Current configuration is 2x2 Grid, with 18x18 threads per block, card is 8800 GTX Ultra. In the GM of the card I store one array (2D translated to 1D notation) of size 2x2x16x16. Each of the inner threads is loading 1 element of 16x16 slice of my array, each of the boundary threads loads one of the boundary elements of the adjacent different 16x16 slice.
Cubin file gives me the following values: registers =7, lmem = 0, smem = 48
My shared memory array is defined externally and has the size of 41818=5184
When I enter these values in the profiler, the multiprocessor warp occupancy at all 3 graphs is max or near the max.
What could be the problem causing this code to run slower than the CPU code?
How could I use the proposed 2D texture, and what benefits does it have over using GM?
And now we find the root of your problem. 4 blocks will never make sufficient utilization of the device. At the parameters you give, the device can keep 2 blocks per multiprocessor active (according to the occupancy calculator), a total of 32 blocks. You probably need at least 64 to really see the device enter into a linear performance regime. The GPU gets its insane speed by keeping thousands of relatively slowly executing threads all moving at once. You can make your problem size a LOT bigger and still keep the same running time you have now.