I allocate memory on the device for input and output using the main thread … then I start a new thread that calculates the result, which is placed into this memory space. But it seem that the input data values are not correct in the thread running, also the result when memcpy-ied from the device at the end is empty.
Is there some trick to use threads that use already allocated device memory. The sample in the SDK has the thread allocating the device memory … I mean thread can’t see it as global memory.
The problem you encounter is related to the notion of “context”: any CUDA call is performed within a context, which is created either explicitely (in the driver API) or implicitely during the first call in the runtime API. There can only be a single context loaded at the same time on a pthread, and only one pthread may use the context at the same time. Just like a pointer is limited to the address space of a process in C/unix, a CUDA device pointer only makes sense within the context where it was allocated.
If you want to use multiple threads, you can for instance dedicate one thread per device, and have that thread, and have the other threads deleguate the memcpy operations to that thread (for instance with a per-context queue of request and a per-context condition variable). If you don’t want to do that, and if the interactions between the threads are not too frequent, you can also use the context migration API, but note that this is very slow (and pretty buggy).
Thanks for the detailed reply, and I have implemented a simple solution.
I have decomposed the kernel into two portions that are to be executed on separate devices with their respective threads. Each thread is responsible for a its portion on its assigned device. The thread 1 waits for thread 0 to complete, and uses the partial results from it. This involves a device->host->device memcpy()-ies. Afterwards, I am looking to create a pipeline by modifying this simple solution.
Maybe this is too obvious, but why not let the kernel associated with thread 0 do all the work? Your idea of device_host_device memcpys will be very costly, unless the amount of data to be transferred is very small and the kernel computation times are very long. Find something else for thread 1 to do!
I have 3 GPUs on system.
I load the input image to the devices. GPU 1 does the first job in ~60ms. But the 2nd job is ~120ms, so that to use the 2 remaining GPUs, I do functional decomposition of this task into 2 portions of ~60ms each; involving devicetodevice memcpy-ies. The objective is to reduce the total time from 120ms to 60ms.