In a kernel that takes 40.2 milliseconds its taking 37 milliseconds when using the stack instead of constant memory. In this particular application we are squeezing every last drop of speed we can find I just would have thought the broadcast nature of constant memory would have been a huge improvement over the stack which is cached on L1.
Reading your blog post leads me to believe I misunderstood the nature of constant memory and don’t understand what the Best Practices guide means when it says “If all threads of a warp access the same location, then constant memory can be as fast as a register access.”.
The constant cache is able to broadcast data to all threads in a warp. If access to the constant cache is uniform, that is, all threads in the warp access the same location, this access is about as fast as register access, as the Best Practices guide correctly points out. Kernel arguments are typically accessed in uniform fashion. If the access to the constant cache is not uniform, the code will still work, but there is serialization in that there are as many consecutive accesses to the constant cache as there are different addresses used across the warp.
__global__ void kernel()
//millions of indices to be called
int index = blockIdx.x * blockDim.x + threadIdx.x;
//The part I am concerned about
double ia = index/a;
double ib = index/b;
double ic = index/c;
double id = index/d;
double ie = index/e;
if(index < ...)
//do more stuff
Is it possible the threads can get slightly out of sync and thus end up serializing a read to a, b, c, d, or e? (Like for example, one of the threads reads e before another thread reads d).
Constant cache accesses cause serialization if the access is not uniform across the threads in a warp. Threads in a warp execute in lockstep unless there is divergent control flow. The portion of the code that you marked as being the area of concern has uniform (thus, non-divergent) control flow, and the addresses for the constant cache accesses are not functions of the the thread index or other data. So the accesses to variables a,b,c,d,e are uniform.
Even if there were some serialization in constant cache accesses at this point in the kernel, it would likely be a minor or unmeasurable effect unless your kernel does next to nothing after this code block in which case it would likely be subject to other bottlenecks to overall performance such as launch latency, or memory bandwidth.
In general, it is a good idea to let the CUDA profiler guide optimization efforts to focus on the most important bottlenecks for each kernel. “Don’t sweat the small stuff”.
I was writing a kernel and had a lot of parameters at the input. Switched to constant memory and saw no difference in speed. Just much more readable code without the need to pass some parameters in every function.