shared memory or kernel argument what is faster?

Hi,

in my application I have a structure of 240 bytes with physical constants, pointers and texture binding pointers that remains unchanged and this structure is the same for all threads and all blocks. The kernel makes a lot of computations with the data form this structure and makes a lot of operations during the call.

Now this structure generates in CPU, stored in the global memory, and copied to shared memory as the first command in the kernel.

The computational overhead of this copying (I already measured it) is nothing with comparison to the total work producing by this kernel.

My question is the following. If I send this structure as the argument of the kernel (it is smaller to 256 bytes, I am lucky!!!), can I improve the computational performance? I mean that now the structure stay in the shared memory, so it can be collisions with the memory access from different threads, and if I move this structure to the argument space I might expect better performance. Am I right?

PS: certainly I can test it by myself and get a timing result, however, I want to know the real answer. I am implementing this way now and come back to this topick to show how it speeds up or slows down!

Sincerely

Elena

AFAIR: arguments are stored in shared memory so you should see no difference

if this is the case, why don’t you use constant memory? saves you the 240 bytes of shared memory and should be as quick as registers(?) and you don’t have any coalescing restrictions

Vrah

Hi,

thank you for your kind reply!

I see it somewhere in forum but was not sure, thank you for this information!

Is constant and shared memory accesses have the same speed? Due to 5.1.2.3 chapter of NVIDIA CUDA Programming Guide,

I cannot guarantied that all threads of my kernels access the same data from this 240 byte structure, so, it could be a slow down? Am I right?

Sincerely

Elena

You’re right here. I missed that restriction, most of my kernels actually do access the same constant in the same cycle. If that’s not the case for your implementation then you might experience a slow down. So maybe you should stick to the shared memory version although you also have to mind the bank conflicts there. But these conflicts most likely are less strict than the constant memory restriction.

Vrah