Support for multi-threaded apps on cuda and multiple applications on cuda

Hi,

I have a few questions about the capabilities of CudaRuntime for supporting multiple contexts.

Problem 1:
Let’s say I have a multi-threaded application, where each thread is a pthread. For simplicity, lets assume that there is no dependency between the threads. In other words, each
thread is independent from the other. Inside the thread function of each thread, I perform some cuda operations. Note that I am trying to run all the threads in the same GPU.
In this scenario I observe the following:

  1. When these threads are started at the same time (without any delay between creation of threads), they run to completion successfully.
  2. When I insert a sleep time between the creation of threads, some thread finish successfully, while others have some errors during launch. This is
    consistent when I have 4 threads.

So, the question is, does cudaRuntime guarantee a correct support for multi-threaded programs? Or, in other words, does cudaRuntime handle thread context within a program correctly?

Problem 2:
Let’s say I am trying to run multiple applications on a single GPU at the same time. However, again, there might be some start delays between each application. I have tested this with a
set of applications. Sometimes, this mix has multiple instances of the same application or different applications itself.
Following are the observations:

  1. It runs to completion mostly when all the applications are started about the same time.
  2. When there is a sleep time between the start of various applications, I notice some “unspecified launch error”.

So, the question is, does cudaRuntime guarantee support for multiple applications on a single GPU at the same time?

I am using cuda 3.0 version.

If you have some information on this or have experienced this before, please share your thoughts with me. This will be of great help.

Thanks in advance!

Yes, CUDA does support multithread execution, but you have to be careful how it’s used.

There can only be 1 CUDA context attached to each host thread, so calling CUDA runtime functions from multiple threads
will implicitly create multiple CUDA contexts, as is suggested by 3.4 Interoperability between Runtime and Driver APIs
in the programming manual.

For me, the main issue this causes is memory allocated in 1 context can’t be used in another, but there’s a moderately complex way to overcome this.

Hi, thank you for your answer.

I am aware of the fact that CUDA runtime creates a context for each host threads. In fact, my application consists of multiple threads and therefore spawns multiple CUDA contexts. Moreover, the threads are completely independent, and they do not share any variable.
The problems I experience are not in accessing memory. When the threads are not started simultaneously, I get mostly CUDA launch errors.

It seems to me that the behavior of the CUDA runtime is not deterministic: the execution sometimes successfully runs to completion, and sometimes not.

What’s the error? cudaGetErrorString(cudaGetLastError())

unspecified launch failure

For completeness, this is the function that I run in each pthread. Some of them run to completion, and some report the unspecified launch failure. As I mentioned, the execution outcome is not deterministic.

void gpu_func(int ndim, int num_rows, int num_cols, float* in_mat, float* out_mat, int num_gpu_blocks, int num_gpu_threads, int tid, FILE *log) {

cudaError_t rc;

float* d_in_mat, *d_out_mat;

clk_init_start = rtclock();

rc = cudaSetDevice(0);

if (rc != cudaSuccess) {
printf(“%d, cudaSetDevice() :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_init_end = rtclock();
clk_init_time = clk_init_end - clk_init_start;

printf(“Time for device initialization:%lf\n”,clk_init_end - clk_init_start);

rc = cudaMalloc((void**) &d_in_mat, sizeof(float)num_rowsnum_cols);

if (rc != cudaSuccess) {
printf(“%d, cudaMalloc(d_in_mat) :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

rc = cudaMalloc((void**) &d_out_mat, sizeof(float)num_rowsnum_cols);

if (rc != cudaSuccess){
printf(“%d, cudaMalloc(d_out_mat) :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

printf(“%d:: d_in_mat=%p\n”,tid,d_in_mat);
printf(“%d:: d_out_mat=%p\n”,tid,d_out_mat);

clk_copy_start = rtclock();

rc = cudaMemcpy(d_in_mat, in_mat, sizeof(float)num_rowsnum_cols, cudaMemcpyHostToDevice);

if (rc != cudaSuccess) {
printf(“%d, xfer1 :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
fprintf(log,“Data Xfer1:%d\t%1f\t%lf\t%lf\n”, tid,clk_copy_start, clk_copy_end, clk_copy_time);
fflush(log);

clk_copy_start = rtclock();

rc = cudaMemcpy(d_out_mat, out_mat, sizeof(float)num_rowsnum_cols, cudaMemcpyHostToDevice);

if (rc != cudaSuccess) {
printf(“%d, xfer2 :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
fprintf(log,“Data Xfer2:%d\t%1f\t%lf\t%lf\n”, tid,clk_copy_start, clk_copy_end, clk_copy_time);
fflush(log);

printf(“No. of blocks:%d, No. of threads:%d\n”, num_gpu_blocks, num_gpu_threads);

dim3 grid(num_gpu_blocks, 1, 1);
dim3 thread(num_gpu_threads, 1, 1);

clk_comp_start = rtclock();

Jacobi_device<<<grid, thread, 1024>>>(ndim, num_rows, num_cols, d_in_mat, d_out_mat, num_gpu_blocks, num_gpu_threads);
cudaThreadSynchronize();
rc = cudaGetLastError();
if (rc != cudaSuccess){
printf(“%d, kernel :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_comp_end = rtclock();
clk_comp_time = clk_comp_end - clk_comp_start;

//printf(“Time for computation:%lf\n”,clk_comp_time);
fprintf(log,“Computation:%d\t%1f\t%lf\t%lf\n”, tid,clk_comp_start, clk_comp_end, clk_comp_time);
fflush(log);

clk_copy_start = rtclock();

rc = cudaMemcpy(out_mat, d_out_mat, sizeof(float)num_rowsnum_cols, cudaMemcpyDeviceToHost);
if (rc != cudaSuccess) {
printf(“%d, xfer3 :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
//printf(“Data Xfer3:%1f\n”, clk_copy_time);
fprintf(log,“Data Xfer3:%d\t%1f\t%lf\t%lf\n”, tid,clk_copy_start, clk_copy_end, clk_copy_time);
//printf(“Time for data copy:%lf\n”,total_copy_time);
fflush(log);

rc = cudaFree(d_in_mat);
if (rc != cudaSuccess) {
printf(“%d, cudaFree(d_in_mat) :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}
rc = cudaFree(d_out_mat);
if (rc != cudaSuccess) {
printf(“%d, cudaFree(d_out_mat) :: %s\n”,tid,cudaGetErrorString(rc));
fflush(log);
return;
}

}

Based on my experience, it sounds like you might have an out of bounds memory bug somewhere that shows up later (Unspecified launch error).

Have you tried to comment out pieces of your code in a divide & conqueror way?

Hi,

I am trying to share buffers memory, that was allocated by only one thread, between two threads. Because of the two different contexts, the second thread cannot access the device memory. What is the “moderately complex” trick you use to overcome this?

Thanks in advance,

Sam

Fundamentally, device memory allocations are context local and contexts are thread local. So there is no direct way of sharing device pointers between different threads.

You can use portable pinned memory to make a common buffer for all threads to share, and use that as a staging point for device to device transfers between contexts.

@TRT_Sam: A very useful class seems to be the ‘GPUWorker’, where one CPU GPU worker Thread objects manages one Cuda context related to one GPU.
See The Official NVIDIA Forums | NVIDIA
or http://lissom.googlecode.com/svn/trunk/MPICUDALissomV1/src/GPUWorker.cc

GPUWorker can be handy, but also adds a lot of complications to your code and build process.

Tim has hinted that this type of management may be getting easier soon… Presumably this means that a future version of CUDA will no longer tie threads to contexts so tightly.

Sam, the current method to use the same CUDA context from multiple host threads is to use cuCtxPushCurrent() & cuCtxPopCurrent()

to bind the context to the host thread each time it wants to use it. This doesn’t really cost any performance from my experience. Only more code complexity and danger of forgetting to acquire/release the context.

You would use it like this:

main()

{

cuCtxCreate()    // do this before all CUDA calls so that CUDA runtime layer will use this context instead of creating an implicit context

global_context = cuCtxPopCurrent()

cuCtxPushCurrent(global_context)

cudaCode()

cuCtxPopCurrent()

}

The next release of CUDA offers a new hope for multithreaded and multi-GPU developers. :)

CUDA Episode IV: A New Hope