I have read the multi-threaded programming on multi-GPU. I wonder whether it is possible to use a similar style to program on a single GPU. Specifically, I first allocate a device memory space, then assign this address to two different threads. These two different threads start their own kernel and do some computation on the same device memory space independently. However, I found I could not get the correct results. It looks like two threads cannot share this device memory pointer and access it correctly. I have tested this on the GPU with compute capability 1.3. Is it because it does not support concurrent kernel executions? Or some people have more suggestions?