shared mem function argument optimization

I have a number of functions which have arguments which are the same for all threads in a block. Does the nvcc compiler guarantee that if the argument is the same for all threads, it will use shared memory (and only one calculation), or do I have to use some global scope shared variable and some “if(threadIdx.x == blockDim.x - 1) { calculate argument }”? What about return values?

Thanks in advance.

Are you talking about global functions or device functions? The answer for global functions is “Yes, all function arguments go into shared memory.”

device functions are a different story. By default they are all inlined, so the question doesn’t quite apply, since the function body is mixed into the global function directly. That said, nvcc will never put things into shared memory unless you explicitly mark the variable with the shared identifier. You are right that, in principle, a value which is computed purely from constants and shared memory values could be automatically placed in shared memory, but the compiler can’t do that kind of analysis at the moment.

Yes, I was asking about device functions, sorry I didn’t specify. If I try to specify an argument as shared [in the type signature], I get “memory qualifier on parameter is not allowed”. If I do something like

shared int arg;

f(arg)

will that ensure that the argument will be shared when execution is “in” the function body? If extra register are allocated, it could mean at worst, nargs * ~512 threads per block, right?

Thanks,

Nicholas

I have used this method in the past and successfully reduced my kernels register count, so I would assume it works. The only way to know for sure would be to output the .ptx for a version of the code with the device function there and with the code inlined by hand to see if the compiler handles things the same.

To answer your OP: yes you should use something like “if (threadIdx.x==1) arg = calculation” to calculate the shared value.

And just a warning: these methods don’t always reduce the register count. For many operations the shared value must be copied into a register before it can be used in arithmetic or whatever so sometimes the register count is unaffected by using the shared value.