LDU instructions

Hi,
Can someone please give me an example of C code producing an LDU instruction?
I have some code in which almost all running threads read from the same location. The code becomes a lot faster if I place the read memory inside constant memory. However, I can’t actually use constant memory in production code since I know it’s size only during run time.
One more thing - Is there any way I can tell cublas to use an LDU instruction for a certain parameter?

Thanks.

Hi,
Can someone please give me an example of C code producing an LDU instruction?
I have some code in which almost all running threads read from the same location. The code becomes a lot faster if I place the read memory inside constant memory. However, I can’t actually use constant memory in production code since I know it’s size only during run time.
One more thing - Is there any way I can tell cublas to use an LDU instruction for a certain parameter?

Thanks.

Slides 34-36 of the Fundamental Optimizations portion of the tutorial that can be found here:
http://www.nvidia.com/object/sc10_cuda_tutorial.html