Multithreading and CUDA

Anis · April 2, 2010, 3:03pm

I allocate memory on the device for input and output using the main thread … then I start a new thread that calculates the result, which is placed into this memory space. But it seem that the input data values are not correct in the thread running, also the result when memcpy-ied from the device at the end is empty.

Is there some trick to use threads that use already allocated device memory. The sample in the SDK has the thread allocating the device memory … I mean thread can’t see it as global memory.

Anis

gonnet · April 2, 2010, 4:28pm

Hello,

The problem you encounter is related to the notion of “context”: any CUDA call is performed within a context, which is created either explicitely (in the driver API) or implicitely during the first call in the runtime API. There can only be a single context loaded at the same time on a pthread, and only one pthread may use the context at the same time. Just like a pointer is limited to the address space of a process in C/unix, a CUDA device pointer only makes sense within the context where it was allocated.

If you want to use multiple threads, you can for instance dedicate one thread per device, and have that thread, and have the other threads deleguate the memcpy operations to that thread (for instance with a per-context queue of request and a per-context condition variable). If you don’t want to do that, and if the interactions between the threads are not too frequent, you can also use the context migration API, but note that this is very slow (and pretty buggy).

Hope this helps,

CÃ©dric

Anis · April 7, 2010, 11:26am

Thanks for the detailed reply, and I have implemented a simple solution.

I have decomposed the kernel into two portions that are to be executed on separate devices with their respective threads. Each thread is responsible for a its portion on its assigned device. The thread 1 waits for thread 0 to complete, and uses the partial results from it. This involves a device->host->device memcpy()-ies. Afterwards, I am looking to create a pipeline by modifying this simple solution.

Suggestions … if there are any.

Anis

MMB · April 12, 2010, 7:48pm

Maybe this is too obvious, but why not let the kernel associated with thread 0 do all the work? Your idea of device_host_device memcpys will be very costly, unless the amount of data to be transferred is very small and the kernel computation times are very long. Find something else for thread 1 to do!

MMB

Anis · April 13, 2010, 1:23pm

I have 3 GPUs on system.
I load the input image to the devices. GPU 1 does the first job in ~60ms. But the 2nd job is ~120ms, so that to use the 2 remaining GPUs, I do functional decomposition of this task into 2 portions of ~60ms each; involving devicetodevice memcpy-ies. The objective is to reduce the total time from 120ms to 60ms.

anis

avidday · April 13, 2010, 1:42pm

What does that mean? Are you trying to use cudaMemcpy with a cudaMemcpyDeviceToDevice transfer type to copy data directly between GPUs?

Anis · April 14, 2010, 11:44am

of course not … devicetohost then hosttodevice.

Topic		Replies	Views
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3125	March 11, 2018
CPU-GPU question CUDA Programming and Performance	6	814	June 2, 2011
CUDA driver API - multiple threads with the same CuContext CUDA Programming and Performance cuda	7	2313	October 28, 2022
Bound multiple host threads to the same context? CUDA Programming and Performance	3	1890	September 26, 2017
CUDA with Pthread CUDA Programming and Performance	5	8447	July 24, 2008
MultiGPUs newbie question Data transformation problem CUDA Programming and Performance	12	5152	March 18, 2008
Multithreadingã€€problem How to use cudaMemcpy() in a new thread? CUDA Programming and Performance	0	1506	April 27, 2009
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9592	January 1, 2009
Multiple GPU, device-memory CUDA Programming and Performance	1	1881	April 23, 2010

Multithreading and CUDA

Related topics