Can someone please give me an example of C code producing an LDU instruction?
I have some code in which almost all running threads read from the same location. The code becomes a lot faster if I place the read memory inside constant memory. However, I can’t actually use constant memory in production code since I know it’s size only during run time.
One more thing - Is there any way I can tell cublas to use an LDU instruction for a certain parameter?