Failure with independent devices on independent processes Try it yourself!

I will likely submit a new bug on the developer site, but before I do I’d love to have other people with multi-GPU try to reproduce it.

The basic problem: When a process starts a kernel on one device, a different process using a different device can be affected by that first process.

It may block until the first process is done (even though it’s using a different device) or it may just fail.

Attached is simple source code that is basically just launches a 30 second busy-loop on the device(s) specified in the command line.

If you have multiple GPUs, give it a try! It compiles with just “nvcc multi.cu -o multi”. If your non-display GPUs are device 0 and device 2 , try running “./multi 0” in one shell and “./multi 2” from a second shell.

Does it fail?

There are more examples in the source code.

I get multiple different failure modes, not always reproducable. I’d like to have a couple more data points for the bug report when I submit it.

#include <iostream>

#include <stdio.h>

#include <ctime>

#include <cassert>

#include <vector>

/* Test tool to show off multi-process problems with multiple CUDA GPUs.

   Sept 1 2010  Steve Worley.

   public domain.

compile with nvcc multi.cu -o multi

This is an example program which runs a load on one or more CUDA devices, waits for the computes to finish, and exits.  Simply pass the device ID(s) as arguments, so for example, "./multi 0 2 3" would run on three devices, #0 #2 and #3. 

This tool helps show multi-GPU failure modes. They are not always

reproducable but after a few tries these can happen pretty quickly.

In these examples, I used toolkit 3.0, driver 256.44, Ubuntu 10.4 64 bit.

The system has  2 GTX295s, device numbers 0, 2, 3, and 4. The display adapter

is an embedded nForce 980a, using device 1.

All problems occur only when two different PROCESSES use CUDA apps on DIFFERENT devices simultaneously.

Typical failures:

----

In one shell run ./multi 0

In a different shell run ./multi 2

the second shell will (usually) successfuly SetDeviceFlags() but then hang at the cudaHostAlloc() call before the devices can even be started. When the first process completes (after about 30 seconds) the second process will wake up, and the second device will intitialize and run.

----

In one shell run ./multi 2

In a different shell run ./multi 0

The opposite order of the above.

Typically device 2 will run properly, but the device 0 process will fail

and exit at the HostGetDevicePointer() call.

---

In one shell run ./multi 0 2

In a different shell run ./multi 3 4

The above pair will often fail as the first case, but sometimes will work.

In the cases when it does work, interrupt one of the runs with ^C. After a pause, it will exit. But then the OTHER process (running on independent devices) will also exit.

*/

using std::vector;

vector<int> deviceID;

unsigned int *h_results; // global

__global__ void kernel(int in, int *out)

{

  if (0!=threadIdx.x || 0!=blockIdx.x) return;

int v=in;

  /* slow useless compute. About 30 seconds on a GTX295. */

  for (int i=0; i<0x10000000; i++) {

	v=123*v+456;

  }

if (v!=1) v=0; // can never happen but compiler doesn't know that

  *out=100000+in+v;

}

static void* localThreadFunction(void *ptr)

{

  int index= *((int *)ptr);

  int ID= deviceID[index];

  cudaError_t err;

  int *d_result;

/* Flags must be set BEFORE context creation */

  err=cudaSetDeviceFlags(cudaDeviceMapHost);

  if (cudaSuccess!=err) {

	printf("Thread failed to set device flags\n");

	exit(0);

  }

err=cudaSetDevice(ID);

  if (cudaSuccess!=err) {

	printf("Thread failed to set device\n");

	exit(0);

  }

err=cudaHostGetDevicePointer((void **)&d_result, h_results, 0);

  if (cudaSuccess!=err) {

	printf("CUDA failure  in thread HostGetDevicePointer\n");

	exit(0);

  }

printf("Device %d initialized successfully, starting compute.\n", ID);

kernel<<<64, 64>>>(ID, d_result+index);

  cudaThreadSynchronize(); // unnecessary but harmless

printf("Device %d finished compute.\n", ID);

  return NULL;

}

static void usage()

{

  std::cout << "multi: first_device_ID second_device_ID third...." << std::endl;

  exit(0);

}

int main(int argc, char **argv)

{

  cudaError_t err;

if (argc<2) usage();

for (int i=1; i<argc; i++) {

	int dev=atoi(argv[i]);

	if (dev>0) deviceID.push_back(dev);

	else if (dev<0) usage();

	else if (!isdigit(argv[i][0])) usage();

	else deviceID.push_back(0);

  }

/* prepare zero-copy memory for a returned result, one word per device */

printf("About to SetDeviceFlags\n");

  err=cudaSetDeviceFlags(cudaDeviceMapHost);  

  if (cudaSuccess!=err) {

	printf("CUDA failure calling SetDeviceFlags\n");

	exit(0);

  }

  printf("SetDeviceFlags finished.\n");

cudaHostAlloc((void **)&h_results, deviceID.size()*sizeof(int),

		cudaHostAllocMapped|cudaHostAllocPortable);

if (cudaSuccess!=err) {

	printf("CUDA failure in HostAlloc\n");

	exit(0);

  }

  printf("cudaHostAlloc finished.\n");

memset(h_results, 0, deviceID.size()*sizeof(int));

/* launch one thread per device to do a compute */

pthread_t thread[16];

  int arg[16];

for (int i=0; i<deviceID.size(); i++) {

	arg[i]=i;

	cudaDeviceProp deviceProp;											   

	cudaGetDeviceProperties(&deviceProp, deviceID[i]);

	printf("Starting thread for compute on device %d : %s\n", 

	   deviceID[i], deviceProp.name);

	(void)pthread_create(&thread[i], NULL, localThreadFunction, 

			 (void *)(&arg[i]));

  }

printf("All threads launched\n");

for (int i=0; i< deviceID.size(); i++) 

	pthread_join(thread[i], NULL);

printf("All threads completed.\n");

  for (int i=0; i<deviceID.size(); i++) {

	printf("Device %d returned %d\n", deviceID[i], h_results[i]);

  } 

}

I would love to try it out on my multi-gpu rig, but trying it using CUDA 3.1, Ubuntu 9.04 64 bit with the 256.53 driver, I can’t even get a single instance to run:

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc multi.cu -o multi

avidday@cuda:~$ ./multi 0

About to SetDeviceFlags

SetDeviceFlags finished.

cudaHostAlloc finished.

Starting thread for compute on device 0 : GeForce GTX 470

All threads launched

CUDA failure  in thread HostGetDevicePointer

I would love to try it out on my multi-gpu rig, but trying it using CUDA 3.1, Ubuntu 9.04 64 bit with the 256.53 driver, I can’t even get a single instance to run:

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc multi.cu -o multi

avidday@cuda:~$ ./multi 0

About to SetDeviceFlags

SetDeviceFlags finished.

cudaHostAlloc finished.

Starting thread for compute on device 0 : GeForce GTX 470

All threads launched

CUDA failure  in thread HostGetDevicePointer

Interesting… I have kept to the 3.0 toolkit for other reasons… I wonder if your failure is yet another issue?

I did get confirmation of the multi-gpu multi-process issue and have reported it.

Interesting… I have kept to the 3.0 toolkit for other reasons… I wonder if your failure is yet another issue?

I did get confirmation of the multi-gpu multi-process issue and have reported it.

Hello,

Tested under CUDA 3.1.

I have only 2 cards (0 and 1)

No bug with multi 0 1

But it indeed hangs (“serialize”) if I launch multi 0 and multi 1

Hello,

Tested under CUDA 3.1.

I have only 2 cards (0 and 1)

No bug with multi 0 1

But it indeed hangs (“serialize”) if I launch multi 0 and multi 1

Is this even supposed to work?

The recommended way to run multiple devices is in separate threads within the same process. I have code working that does that,
and that is what your “multi 0 1” will do. I think in this case there is a single instance of the CUDA runtime in the main thread which starts
up on the first CUDA call. Subsequent pthreads are using the same runtime, but with different device IDs.

But with “multi 0” in one shell and “multi 1” in another, each process will try to start the CUDA runtime. I am guessing the runtime does not
allow that and the second process is not allowed to proceed until the first has finished. That would explain why it hangs at CudaHostAlloc().

Unless you have seen some documentation that suggests the runtime supports multiple simultaneous host processes (as well as threads)
I would think this is not a bug.

It may be possible to do this using the lower level CUDA driver API, but I don’t know. Perhaps a CUDA runtime expert could help us?

Is this even supposed to work?

The recommended way to run multiple devices is in separate threads within the same process. I have code working that does that,
and that is what your “multi 0 1” will do. I think in this case there is a single instance of the CUDA runtime in the main thread which starts
up on the first CUDA call. Subsequent pthreads are using the same runtime, but with different device IDs.

But with “multi 0” in one shell and “multi 1” in another, each process will try to start the CUDA runtime. I am guessing the runtime does not
allow that and the second process is not allowed to proceed until the first has finished. That would explain why it hangs at CudaHostAlloc().

Unless you have seen some documentation that suggests the runtime supports multiple simultaneous host processes (as well as threads)
I would think this is not a bug.

It may be possible to do this using the lower level CUDA driver API, but I don’t know. Perhaps a CUDA runtime expert could help us?

You bring up a good point that CUDA may not have documented an explicit promise of the ability to run more than one GPU process on a machine at once. But in that case then this would still be an example of a bug where CUDA is not handling such multi-process cases properly with a graceful failure.

The bug has already been confirmed by NVIDIA.

You bring up a good point that CUDA may not have documented an explicit promise of the ability to run more than one GPU process on a machine at once. But in that case then this would still be an example of a bug where CUDA is not handling such multi-process cases properly with a graceful failure.

The bug has already been confirmed by NVIDIA.

Hi,

I ran it on a 2 GTX295 linux system. Ran it twice:

Run1: multi 1 (session 1) and multi 0 (session 2)

Run2: multi 2 (session 1) and multi 1 (session 2)

Both runs ran fine and gave the expected results (I guess :) ):

Device 0 returned 100000

Device 1 returned 100001

Device 2 returned 100002

As for the previous remark, our code has always had the option to either run it as 8 seperate processes working against 8 GPUs (2 S1070)

or 8 threads in one process working against the same 8 GPUs.

I’ve never seen any problem with either - at least pre-fermi.

I did have a lot of trouble connecting 3 or 4 S1070s and finally gave up and only use 2 S1070 per machine.

eyal

Hi,

I ran it on a 2 GTX295 linux system. Ran it twice:

Run1: multi 1 (session 1) and multi 0 (session 2)

Run2: multi 2 (session 1) and multi 1 (session 2)

Both runs ran fine and gave the expected results (I guess :) ):

Device 0 returned 100000

Device 1 returned 100001

Device 2 returned 100002

As for the previous remark, our code has always had the option to either run it as 8 seperate processes working against 8 GPUs (2 S1070)

or 8 threads in one process working against the same 8 GPUs.

I’ve never seen any problem with either - at least pre-fermi.

I did have a lot of trouble connecting 3 or 4 S1070s and finally gave up and only use 2 S1070 per machine.

eyal

I can confirm that I see this problem with an S1070 connected to an HP DL160 G5.

[username@nv1 ~]$ nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 1
COMPUTE mode rules for GPU 3: 1

[username@nv1 cuda]$ ./multi 0
About to SetDeviceFlags
SetDeviceFlags finished.
cudaHostAlloc finished.
Starting thread for compute on device 0 : Tesla T10 Processor
All threads launched
CUDA failure in thread HostGetDevicePointer

I see this failure in at least one of the threads every time I try the three tests described above.

This is with CUDA 3.1 and the latest driver (as of last week: 256.53).

When I change the mode back to 0 (Normal mode rather than compute exclusive) I don’t see any errors. However, if I start one thread approx. 5 seconds after the first, the second one gets blocked here:

[username@nv1 cuda]$ ./multi 2
About to SetDeviceFlags
SetDeviceFlags finished.

[blocks here waiting for the first thread to finish…]

And then at the exact moment that the first thread ends, the second one continues, takes a while to compute, and finally finishes.

This is really bad now that we’re looking at finally running some production jobs.

I can confirm that I see this problem with an S1070 connected to an HP DL160 G5.

[username@nv1 ~]$ nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 1
COMPUTE mode rules for GPU 3: 1

[username@nv1 cuda]$ ./multi 0
About to SetDeviceFlags
SetDeviceFlags finished.
cudaHostAlloc finished.
Starting thread for compute on device 0 : Tesla T10 Processor
All threads launched
CUDA failure in thread HostGetDevicePointer

I see this failure in at least one of the threads every time I try the three tests described above.

This is with CUDA 3.1 and the latest driver (as of last week: 256.53).

When I change the mode back to 0 (Normal mode rather than compute exclusive) I don’t see any errors. However, if I start one thread approx. 5 seconds after the first, the second one gets blocked here:

[username@nv1 cuda]$ ./multi 2
About to SetDeviceFlags
SetDeviceFlags finished.

[blocks here waiting for the first thread to finish…]

And then at the exact moment that the first thread ends, the second one continues, takes a while to compute, and finally finishes.

This is really bad now that we’re looking at finally running some production jobs.

Try with the CUDA 3.2 driver–that should solve any problems with exclusive mode.

Try with the CUDA 3.2 driver–that should solve any problems with exclusive mode.

SPWorley,

I created a bug-report based on your code which would ULF even on single-GPU system.
I removed pthreads, zero-copy etc… to produce the most basic code that will cause ULFs,

NV has confirmed and said that the fix will be avilable in next CUDA release,
I will notify this thread when I get notification of the fix in an intermediate driver release,

BEst Regards,
Sarnath

This bug must have been fixed with drivers 270.26 and later.

SPWorley, If you have time, You can check this out. However CTRL_C behaviour will not be fixed with this driver.

I am waiting for CUDA 4.0 driver… I will test for both these behavior with the new driver,

I tried with 270.27 and the bug still persists. Have updated bug report. Awaiting the fix…