Random crushes when using multiple threads for multiple GPUs

johnao73thu · March 13, 2024, 10:55am

Previously, I wrote a quite complex C++ program that allocates memory and launches kernels.

Now I want to make it work on multiple GPUs. Below is what I did:

#include <cuda.h>
#include <cuda_runtime.h>
#include <thread>

void do_work(int device){
  // create context for the specified device
  CUcontext ctx;
  if (cuCtxCreate(&ctx, 0, device) || cuCtxSetCurrent(ctx)) {
    throw std::runtime_error("Failed to create cuda context.");
  }
  // original code
  // cudaMalloc(...)
  // f<<<...>>>();
}

int main(){
  auto t1=std::thread(do_work,0);
  auto t2=std::thread(do_work,1);
  t1.join();
  t2.join();
}

I have tested that if I call do_work sequentially then everything is fine. But when I use two threads it will crush at random places. What is the problem?

striker159 · March 13, 2024, 10:59am

Please show a complete minimal example that reproduces a crash.

johnao73thu · March 13, 2024, 4:26pm

It is a large project so it is hard to locate the problem and construct such an minimal example. From the crushes, it seems that the data may be corrupted, and I am getting “illegal warp address” errors and array overflows (index greater than size).

Could you suggest some “general” pitfalls when using cuda context? Like “cuda context is not thread local so you cannot do this” (which is not true as far as I know). Or suggestions on how to locate the problem or provide more information?

striker159 · March 13, 2024, 5:21pm

Well, you most likely have a bug somewhere in either host code or device code.

Check the return code of each API call, both driver API and runtime API. Run your code with compute-sanitizer to see where device memory errors are coming from. Use valgrind to find host memory errors.

Do you need to handle context creation yourself instead of using the implicit context from the runtime API?

johnao73thu · March 14, 2024, 1:57am

“Do you need to handle context creation yourself instead of using the implicit context from the runtime API?”

Yes, because I am trying to turn the original code which runs on a single GPU to run on multiple GPUs. According to the documents, each thread can only bind to one context at a time and each context belongs to one GPU. So I need to use multiple threads each with a context on a different GPU like the code I posted above. Then the original code with all the runtime API should automatically use the context specified by each thread. So it is confusing why the original bug-free code becomes buggy after this multithreading multi-context change.

Please correct me if my understanding is wrong.

striker159 · March 14, 2024, 6:25am

I think your understanding is wrong. Personally I would avoid explicitly dealing with cuda contexts whenever possible. Simply call cudaSetDevice(deviceId) to select the active device for the allocations, kernel, etc. You can switch gpu whenever you like, also from the same thread.

int main(){
   int* d_array0;
   int* d_array1;

   cudaSetDevice(0);
   cudaMalloc(&d_array0, sizeof(int)); //allocate on gpu 0

   cudaSetDevice(1);
   cudaMalloc(&d_array1, sizeof(int)); //allocate on gpu 1

   cudaSetDevice(0);
   kernel<<<...>>>(d_array0); //kernel runs on gpu 0

   cudaSetDevice(1);
   kernel<<<...>>>(d_array1); //kernel runs on gpu 1

  cudaSetDevice(0);
  cudaDeviceSynchronize(); //synchronize gpu 0
  cudaFree(d_array0);


  cudaSetDevice(1);
  cudaDeviceSynchronize(); //synchronize gpu 1
  cudaFree(d_array1);

}

system · March 28, 2024, 6:26am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12730	January 24, 2011
memcopy fails in multiple pthreads with cudaSetDevice() i m unable to use pthread with multiple GPUs CUDA Programming and Performance	5	3277	August 8, 2011
Multi-GPU with a single thread and driver API? CUDA Programming and Performance	5	4980	July 25, 2008
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11260	February 4, 2008
CUDA,Context and Threading CUDA Programming and Performance	6	19338	May 29, 2012
video cards in parallel ? how the use of various video cards in parallel? CUDA Programming and Performance	7	753	July 15, 2011
How to check work is done by different GPU in multi GPU environment CUDA Programming and Performance	8	2998	June 18, 2009
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11201	November 15, 2008
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5434	February 3, 2010
How to start 2 kernels on 2 devices CUDA Programming and Performance	16	10494	January 7, 2009

Random crushes when using multiple threads for multiple GPUs

Related topics