Constant Memory vs. Kernel Input Argument Access Speed

To start with I’m working with a GK110 card so compute capability 3.5

The behavior I’ve noticed is that passing lots of input arguments into a kernel is faster than loading from a struct sitting in constant memory. Does this make sense?

To give a little background I have several paramaters in the structures sitting in constant memory

__constant__ 
struct1 
{
 double param1;
 float  param2;
 float  param3;
}

myKernel( float* positions, float *outputs )
{
 ...
 x = struct1.param1 * someCalculatedThing + struct1.param2 * someCalulatedThing2 / struct1.param3 
 ...
}

is slower than

myKernel( float* positions, float *outputs, struct1 inStruct )
{

 ...
 x = inStruct.param1 * someCalculatedThing + inStruct.param2 * someCalulatedThing2 / inStruct.param3 
 ...

}

There are many more parameters than that in both the struct and the input arguments but that abstractly represents the behavior I am seeing.

Thanks for the help!

-Merlin

How large of a speed difference are you seeing?

If you’re compiling for sm_35, then the two approaches should be producing nearly identical SASS code.

I discuss using kernel arguments vs constant variables in a blog post here.

Thanks for the response.

In a kernel that takes 40.2 milliseconds its taking 37 milliseconds when using the stack instead of constant memory. In this particular application we are squeezing every last drop of speed we can find I just would have thought the broadcast nature of constant memory would have been a huge improvement over the stack which is cached on L1.

Reading your blog post leads me to believe I misunderstood the nature of constant memory and don’t understand what the Best Practices guide means when it says “If all threads of a warp access the same location, then constant memory can be as fast as a register access.”.

I would suggest inspecting your application in one of the visual profilers. You can drill down and see arguments being marshalled and kernels launched.

3 ms. is an eternity so I’m curious where the difference is coming from.

Note that kernel parameters are passed via constant memory on sm_2x+ devices.

The constant cache is able to broadcast data to all threads in a warp. If access to the constant cache is uniform, that is, all threads in the warp access the same location, this access is about as fast as register access, as the Best Practices guide correctly points out. Kernel arguments are typically accessed in uniform fashion. If the access to the constant cache is not uniform, the code will still work, but there is serialization in that there are as many consecutive accesses to the constant cache as there are different addresses used across the warp.

Bumping this because I have a similar question

If I have the following defined

__device__ const double a, b, c, d, e;

And then I call them in the kernel

__global__ void kernel()
{
    //millions of indices to be called
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    //The part I am concerned about
    double ia = index/a;
    double ib = index/b;
    double ic = index/c;
    double id = index/d;
    double ie = index/e;

    if(index < ...)
    {
        //do more stuff
    }
}

Is it possible the threads can get slightly out of sync and thus end up serializing a read to a, b, c, d, or e? (Like for example, one of the threads reads e before another thread reads d).

Constant cache accesses cause serialization if the access is not uniform across the threads in a warp. Threads in a warp execute in lockstep unless there is divergent control flow. The portion of the code that you marked as being the area of concern has uniform (thus, non-divergent) control flow, and the addresses for the constant cache accesses are not functions of the the thread index or other data. So the accesses to variables a,b,c,d,e are uniform.

Even if there were some serialization in constant cache accesses at this point in the kernel, it would likely be a minor or unmeasurable effect unless your kernel does next to nothing after this code block in which case it would likely be subject to other bottlenecks to overall performance such as launch latency, or memory bandwidth.

In general, it is a good idea to let the CUDA profiler guide optimization efforts to focus on the most important bottlenecks for each kernel. “Don’t sweat the small stuff”.