in my application I have a structure of 240 bytes with physical constants, pointers and texture binding pointers that remains unchanged and this structure is the same for all threads and all blocks. The kernel makes a lot of computations with the data form this structure and makes a lot of operations during the call.
Now this structure generates in CPU, stored in the global memory, and copied to shared memory as the first command in the kernel.
The computational overhead of this copying (I already measured it) is nothing with comparison to the total work producing by this kernel.
My question is the following. If I send this structure as the argument of the kernel (it is smaller to 256 bytes, I am lucky!!!), can I improve the computational performance? I mean that now the structure stay in the shared memory, so it can be collisions with the memory access from different threads, and if I move this structure to the argument space I might expect better performance. Am I right?
PS: certainly I can test it by myself and get a timing result, however, I want to know the real answer. I am implementing this way now and come back to this topick to show how it speeds up or slows down!