Trying to launch cluster kernel failed

I’m not sure if it is correct or not. One of the things that I would do if I were concerned about such a measurement is the SASS analysis that I tried to train you on in your last question. I don’t see any evidence that you have done that here. I personally cannot be sure what the clock64() timestamps are measuring without doing that kind of analysis.

Regarding the comparison to loading from global memory, that might not be surprising for DSM. The behavior is somewhere between “ordinary” global behavior and “ordinary” shared behavior.

Also the code you have now posted won’t compile.