All the following code is doing moving a window of size 2N+1 x 2N+1 about every pixel in a Left and Right image and calculating the mean pixel value in that window. Where the thread size is (32,8,1) and grid size is (32,96,1).
If I comment out the last 2 lines I get 100% occupancy but leaving them gives 33%. I don’t understand since I have 0 incoherent loads or stores. If you have any ideas or thoughts on more efficient ways to do this please let me know.
If you comment the last two lines, your function essentially does nothing (as it doesn’t write anywhere), so the compiler detects that, and a 100% occupancy is normal.
For 100% occupancy, you must use 10 registers maximum.
With that block and grid sizes, with 17 or more registers you will be stuck at 33% occupancy.
To check it, compile passing -v flag to ptxas, and it will report register usage.
if you fill in the information (use --ptxas-options=-v on the nvcc command line) in the occupancy calculator, you will see the limiting factor. My guess is that you use too many registers to get 100% occupancy.
Note that having occupancy lower than 100% does not mean you will not get maximum performance. I have kernels with 33% occupancy that do close to 70 GB/s, so maximum throughput is already achieved.
Your right, I’m using 20 registers. What is driving the number of registers used and what can I do to reduce it? Is there a easy way to push data to the local memory? my .cudin files states:
The registers are used for all the local variables in your kernel, as well as for the texture unit addressing. Given your kernel, there probably isn’t much of a reduction you can get: 20 registers is decent.
Calculate your effective memory bandwidth, counting every texture read and memory write. Given your access pattern, you should be maxing out the device at ~70GiB/s. If this is the case, the hardware doesn’t have anything more to give and increasing the occupancy will not gain you anything.
-before trying to increase occupancy, you should really see how many GB/s you are getting with your kernel. If it is close to the memory bandwith, than it is useless to increase it.
I am not exactly sure if the compiler optimizes away the 4th component of the float4 if it is not used (you could check the generated ptx when specifying -keep on the nvcc commandline)
#pragma unroll does not unroll anything, if N is a certain minimum size (say 8), you can specify #pragma unroll 17
I think your code has an error (normalization is within the for loops, while it should be outside I think), find below how I guess it should be (I also tried to optimize the float4 into 3 float’s)
The compiler has always optimized away the 4th component of a float4 for me. This can even be annoying sometimes when you are trying to perform a coalesced read or write of a float4.
I think occupancy is the problem since I have another kernal that has 260000 stores that has a GPU Time of 725.44 but this kernel only has 65000 stores and removing the final two lines reduces the GPU time from 11526 to 689. A differnece of ~10000.
The programing guide states that
and N is a constant
Functionally it should not matter. I’m just expanding the sumation function. It may be less efficient though.
But why does the number of registers used reduce from 20 to 5 when I comment out the last 2 lines if registers are only used for local variables and texture addressing?
But doesn’t the fact that the profiler shows a nonzero value for the number of stores indicate that it isn’t doing nothing, its just not publishing what it did to global memory.
Because (as has already been said a number of times) the compiler is very good at removing dead code. When you comment out the global memory writes, the compiler sees that your large loops do nothing because the value calculated by them is never written to the output. Since the large loop is “dead code” the compiler sees no reason to include it in the final output.
Regarding your comparison of two kernels with different numbers of stores: You need to also count the number of memory reads in tallying the effective memory bandwidth. Due to your double loop over N with a texture read inside, the number of reads in the kernel you posted here will be quite large.
the first ValueL.x will be divided by 2N+1 a lot more times than the last ValueL.x if written like that or am I missing something? (which is quite possible :P )