Random crushes when using multiple threads for multiple GPUs

Previously, I wrote a quite complex C++ program that allocates memory and launches kernels.

Now I want to make it work on multiple GPUs. Below is what I did:

#include <cuda.h>
#include <cuda_runtime.h>
#include <thread>

void do_work(int device){
  // create context for the specified device
  CUcontext ctx;
  if (cuCtxCreate(&ctx, 0, device) || cuCtxSetCurrent(ctx)) {
    throw std::runtime_error("Failed to create cuda context.");
  }
  // original code
  // cudaMalloc(...)
  // f<<<...>>>();
}

int main(){
  auto t1=std::thread(do_work,0);
  auto t2=std::thread(do_work,1);
  t1.join();
  t2.join();
}

I have tested that if I call do_work sequentially then everything is fine. But when I use two threads it will crush at random places. What is the problem?

Please show a complete minimal example that reproduces a crash.

It is a large project so it is hard to locate the problem and construct such an minimal example. From the crushes, it seems that the data may be corrupted, and I am getting “illegal warp address” errors and array overflows (index greater than size).

Could you suggest some “general” pitfalls when using cuda context? Like “cuda context is not thread local so you cannot do this” (which is not true as far as I know). Or suggestions on how to locate the problem or provide more information?

Well, you most likely have a bug somewhere in either host code or device code.

Check the return code of each API call, both driver API and runtime API. Run your code with compute-sanitizer to see where device memory errors are coming from. Use valgrind to find host memory errors.

Do you need to handle context creation yourself instead of using the implicit context from the runtime API?

“Do you need to handle context creation yourself instead of using the implicit context from the runtime API?”

Yes, because I am trying to turn the original code which runs on a single GPU to run on multiple GPUs. According to the documents, each thread can only bind to one context at a time and each context belongs to one GPU. So I need to use multiple threads each with a context on a different GPU like the code I posted above. Then the original code with all the runtime API should automatically use the context specified by each thread. So it is confusing why the original bug-free code becomes buggy after this multithreading multi-context change.

Please correct me if my understanding is wrong.

I think your understanding is wrong. Personally I would avoid explicitly dealing with cuda contexts whenever possible. Simply call cudaSetDevice(deviceId) to select the active device for the allocations, kernel, etc. You can switch gpu whenever you like, also from the same thread.

int main(){
   int* d_array0;
   int* d_array1;

   cudaSetDevice(0);
   cudaMalloc(&d_array0, sizeof(int)); //allocate on gpu 0

   cudaSetDevice(1);
   cudaMalloc(&d_array1, sizeof(int)); //allocate on gpu 1

   cudaSetDevice(0);
   kernel<<<...>>>(d_array0); //kernel runs on gpu 0

   cudaSetDevice(1);
   kernel<<<...>>>(d_array1); //kernel runs on gpu 1

  cudaSetDevice(0);
  cudaDeviceSynchronize(); //synchronize gpu 0
  cudaFree(d_array0);


  cudaSetDevice(1);
  cudaDeviceSynchronize(); //synchronize gpu 1
  cudaFree(d_array1);

}

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.