global device memory for multiple POSIX threads? multiple threads launching one kernel

Hello,

Is there a way to access the same piece of global device memory from (identic) kernels launched on the same device but by different POSIX Threads? If so, how could this be achieved?

From my experiments with the CUDA Runtime API, I’ve got the impression that kernels launched from different POSIX threads usually ran in different virtual memory spaces on the same device. (Two POSIX-Threads allocating global memory on the same device return the same global device memory address.)

A related question would be if it was possible to launch the same kernel from two POSIX threads (on the same CUDA device) without having to transfer this kernel twice to the device. Was it possible to reduce the resulting space and time overhead to the single transfer and reuse of the kernel on the device?

I’m working on a rather large kernel operating frequently with different parameters on the same large read-only chunk of data. I’d like to use multiple POSIX-Threads in order to exploit device asynchronicity and the multiple cores of the host. Duplicating code and data on the device would limit problem size.

I would be glad if I could be pointed to some relevant documentation.

Thank you for your attention.