Failure with independent devices on independent processes Try it yourself!

SPWorley · September 2, 2010, 4:37pm

I will likely submit a new bug on the developer site, but before I do I’d love to have other people with multi-GPU try to reproduce it.

The basic problem: When a process starts a kernel on one device, a different process using a different device can be affected by that first process.

It may block until the first process is done (even though it’s using a different device) or it may just fail.

Attached is simple source code that is basically just launches a 30 second busy-loop on the device(s) specified in the command line.

If you have multiple GPUs, give it a try! It compiles with just “nvcc multi.cu -o multi”. If your non-display GPUs are device 0 and device 2 , try running “./multi 0” in one shell and “./multi 2” from a second shell.

Does it fail?

There are more examples in the source code.

I get multiple different failure modes, not always reproducable. I’d like to have a couple more data points for the bug report when I submit it.

#include <iostream>

#include <stdio.h>

#include <ctime>

#include <cassert>

#include <vector>

/* Test tool to show off multi-process problems with multiple CUDA GPUs.

   Sept 1 2010  Steve Worley.

   public domain.

compile with nvcc multi.cu -o multi

This is an example program which runs a load on one or more CUDA devices, waits for the computes to finish, and exits.  Simply pass the device ID(s) as arguments, so for example, "./multi 0 2 3" would run on three devices, #0 #2 and #3. 

This tool helps show multi-GPU failure modes. They are not always

reproducable but after a few tries these can happen pretty quickly.

In these examples, I used toolkit 3.0, driver 256.44, Ubuntu 10.4 64 bit.

The system has  2 GTX295s, device numbers 0, 2, 3, and 4. The display adapter

is an embedded nForce 980a, using device 1.

All problems occur only when two different PROCESSES use CUDA apps on DIFFERENT devices simultaneously.

Typical failures:

----

In one shell run ./multi 0

In a different shell run ./multi 2

the second shell will (usually) successfuly SetDeviceFlags() but then hang at the cudaHostAlloc() call before the devices can even be started. When the first process completes (after about 30 seconds) the second process will wake up, and the second device will intitialize and run.

----

In one shell run ./multi 2

In a different shell run ./multi 0

The opposite order of the above.

Typically device 2 will run properly, but the device 0 process will fail

and exit at the HostGetDevicePointer() call.

---

In one shell run ./multi 0 2

In a different shell run ./multi 3 4

The above pair will often fail as the first case, but sometimes will work.

In the cases when it does work, interrupt one of the runs with ^C. After a pause, it will exit. But then the OTHER process (running on independent devices) will also exit.

*/

using std::vector;

vector<int> deviceID;

unsigned int *h_results; // global

__global__ void kernel(int in, int *out)

{

  if (0!=threadIdx.x || 0!=blockIdx.x) return;

int v=in;

  /* slow useless compute. About 30 seconds on a GTX295. */

  for (int i=0; i<0x10000000; i++) {

	v=123*v+456;

  }

if (v!=1) v=0; // can never happen but compiler doesn't know that

  *out=100000+in+v;

}

static void* localThreadFunction(void *ptr)

{

  int index= *((int *)ptr);

  int ID= deviceID[index];

  cudaError_t err;

  int *d_result;

/* Flags must be set BEFORE context creation */

  err=cudaSetDeviceFlags(cudaDeviceMapHost);

  if (cudaSuccess!=err) {

	printf("Thread failed to set device flags\n");

	exit(0);

  }

err=cudaSetDevice(ID);

  if (cudaSuccess!=err) {

	printf("Thread failed to set device\n");

	exit(0);

  }

err=cudaHostGetDevicePointer((void **)&d_result, h_results, 0);

  if (cudaSuccess!=err) {

	printf("CUDA failure  in thread HostGetDevicePointer\n");

	exit(0);

  }

printf("Device %d initialized successfully, starting compute.\n", ID);

kernel<<<64, 64>>>(ID, d_result+index);

  cudaThreadSynchronize(); // unnecessary but harmless

printf("Device %d finished compute.\n", ID);

  return NULL;

}

static void usage()

{

  std::cout << "multi: first_device_ID second_device_ID third...." << std::endl;

  exit(0);

}

int main(int argc, char **argv)

{

  cudaError_t err;

if (argc<2) usage();

for (int i=1; i<argc; i++) {

	int dev=atoi(argv[i]);

	if (dev>0) deviceID.push_back(dev);

	else if (dev<0) usage();

	else if (!isdigit(argv[i][0])) usage();

	else deviceID.push_back(0);

  }

/* prepare zero-copy memory for a returned result, one word per device */

printf("About to SetDeviceFlags\n");

  err=cudaSetDeviceFlags(cudaDeviceMapHost);  

  if (cudaSuccess!=err) {

	printf("CUDA failure calling SetDeviceFlags\n");

	exit(0);

  }

  printf("SetDeviceFlags finished.\n");

cudaHostAlloc((void **)&h_results, deviceID.size()*sizeof(int),

		cudaHostAllocMapped|cudaHostAllocPortable);

if (cudaSuccess!=err) {

	printf("CUDA failure in HostAlloc\n");

	exit(0);

  }

  printf("cudaHostAlloc finished.\n");

memset(h_results, 0, deviceID.size()*sizeof(int));

/* launch one thread per device to do a compute */

pthread_t thread[16];

  int arg[16];

for (int i=0; i<deviceID.size(); i++) {

	arg[i]=i;

	cudaDeviceProp deviceProp;											   

	cudaGetDeviceProperties(&deviceProp, deviceID[i]);

	printf("Starting thread for compute on device %d : %s\n", 

	   deviceID[i], deviceProp.name);

	(void)pthread_create(&thread[i], NULL, localThreadFunction, 

			 (void *)(&arg[i]));

  }

printf("All threads launched\n");

for (int i=0; i< deviceID.size(); i++) 

	pthread_join(thread[i], NULL);

printf("All threads completed.\n");

  for (int i=0; i<deviceID.size(); i++) {

	printf("Device %d returned %d\n", deviceID[i], h_results[i]);

  } 

}

avidday · September 2, 2010, 5:12pm

I would love to try it out on my multi-gpu rig, but trying it using CUDA 3.1, Ubuntu 9.04 64 bit with the 256.53 driver, I can’t even get a single instance to run:

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc multi.cu -o multi

avidday@cuda:~$ ./multi 0

About to SetDeviceFlags

SetDeviceFlags finished.

cudaHostAlloc finished.

Starting thread for compute on device 0 : GeForce GTX 470

All threads launched

CUDA failure  in thread HostGetDevicePointer

avidday · September 2, 2010, 5:12pm

I would love to try it out on my multi-gpu rig, but trying it using CUDA 3.1, Ubuntu 9.04 64 bit with the 256.53 driver, I can’t even get a single instance to run:

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc multi.cu -o multi

avidday@cuda:~$ ./multi 0

About to SetDeviceFlags

SetDeviceFlags finished.

cudaHostAlloc finished.

Starting thread for compute on device 0 : GeForce GTX 470

All threads launched

CUDA failure  in thread HostGetDevicePointer

SPWorley · September 2, 2010, 9:23pm

Interesting… I have kept to the 3.0 toolkit for other reasons… I wonder if your failure is yet another issue?

I did get confirmation of the multi-gpu multi-process issue and have reported it.

SPWorley · September 2, 2010, 9:23pm

Interesting… I have kept to the 3.0 toolkit for other reasons… I wonder if your failure is yet another issue?

I did get confirmation of the multi-gpu multi-process issue and have reported it.

Cuda_Libre · September 3, 2010, 10:17am

Hello,

Tested under CUDA 3.1.

I have only 2 cards (0 and 1)

No bug with multi 0 1

But it indeed hangs (“serialize”) if I launch multi 0 and multi 1

Cuda_Libre · September 3, 2010, 10:17am

Hello,

Tested under CUDA 3.1.

I have only 2 cards (0 and 1)

No bug with multi 0 1

But it indeed hangs (“serialize”) if I launch multi 0 and multi 1

dgwsoft · September 4, 2010, 11:50am

Is this even supposed to work?

The recommended way to run multiple devices is in separate threads within the same process. I have code working that does that,
and that is what your “multi 0 1” will do. I think in this case there is a single instance of the CUDA runtime in the main thread which starts
up on the first CUDA call. Subsequent pthreads are using the same runtime, but with different device IDs.

But with “multi 0” in one shell and “multi 1” in another, each process will try to start the CUDA runtime. I am guessing the runtime does not
allow that and the second process is not allowed to proceed until the first has finished. That would explain why it hangs at CudaHostAlloc().

Unless you have seen some documentation that suggests the runtime supports multiple simultaneous host processes (as well as threads)
I would think this is not a bug.

It may be possible to do this using the lower level CUDA driver API, but I don’t know. Perhaps a CUDA runtime expert could help us?

dgwsoft · September 4, 2010, 11:50am

Is this even supposed to work?

The recommended way to run multiple devices is in separate threads within the same process. I have code working that does that,
and that is what your “multi 0 1” will do. I think in this case there is a single instance of the CUDA runtime in the main thread which starts
up on the first CUDA call. Subsequent pthreads are using the same runtime, but with different device IDs.

But with “multi 0” in one shell and “multi 1” in another, each process will try to start the CUDA runtime. I am guessing the runtime does not
allow that and the second process is not allowed to proceed until the first has finished. That would explain why it hangs at CudaHostAlloc().

Unless you have seen some documentation that suggests the runtime supports multiple simultaneous host processes (as well as threads)
I would think this is not a bug.

It may be possible to do this using the lower level CUDA driver API, but I don’t know. Perhaps a CUDA runtime expert could help us?

SPWorley · September 4, 2010, 11:58pm

You bring up a good point that CUDA may not have documented an explicit promise of the ability to run more than one GPU process on a machine at once. But in that case then this would still be an example of a bug where CUDA is not handling such multi-process cases properly with a graceful failure.

The bug has already been confirmed by NVIDIA.

SPWorley · September 4, 2010, 11:58pm

You bring up a good point that CUDA may not have documented an explicit promise of the ability to run more than one GPU process on a machine at once. But in that case then this would still be an example of a bug where CUDA is not handling such multi-process cases properly with a graceful failure.

The bug has already been confirmed by NVIDIA.

eyalhir74 · September 5, 2010, 9:15am

Hi,

I ran it on a 2 GTX295 linux system. Ran it twice:

Run1: multi 1 (session 1) and multi 0 (session 2)

Run2: multi 2 (session 1) and multi 1 (session 2)

Both runs ran fine and gave the expected results (I guess :) ):

Device 0 returned 100000

Device 1 returned 100001

Device 2 returned 100002

As for the previous remark, our code has always had the option to either run it as 8 seperate processes working against 8 GPUs (2 S1070)

or 8 threads in one process working against the same 8 GPUs.

I’ve never seen any problem with either - at least pre-fermi.

I did have a lot of trouble connecting 3 or 4 S1070s and finally gave up and only use 2 S1070 per machine.

eyal

eyalhir74 · September 5, 2010, 9:15am

Hi,

I ran it on a 2 GTX295 linux system. Ran it twice:

Run1: multi 1 (session 1) and multi 0 (session 2)

Run2: multi 2 (session 1) and multi 1 (session 2)

Both runs ran fine and gave the expected results (I guess :) ):

Device 0 returned 100000

Device 1 returned 100001

Device 2 returned 100002

As for the previous remark, our code has always had the option to either run it as 8 seperate processes working against 8 GPUs (2 S1070)

or 8 threads in one process working against the same 8 GPUs.

I’ve never seen any problem with either - at least pre-fermi.

I did have a lot of trouble connecting 3 or 4 S1070s and finally gave up and only use 2 S1070 per machine.

eyal

Mussel_John · September 21, 2010, 1:42am

I can confirm that I see this problem with an S1070 connected to an HP DL160 G5.

[username@nv1 ~]$ nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 1
COMPUTE mode rules for GPU 3: 1

[username@nv1 cuda]$ ./multi 0
About to SetDeviceFlags
SetDeviceFlags finished.
cudaHostAlloc finished.
Starting thread for compute on device 0 : Tesla T10 Processor
All threads launched
CUDA failure in thread HostGetDevicePointer

I see this failure in at least one of the threads every time I try the three tests described above.

This is with CUDA 3.1 and the latest driver (as of last week: 256.53).

When I change the mode back to 0 (Normal mode rather than compute exclusive) I don’t see any errors. However, if I start one thread approx. 5 seconds after the first, the second one gets blocked here:

[username@nv1 cuda]$ ./multi 2
About to SetDeviceFlags
SetDeviceFlags finished.

[blocks here waiting for the first thread to finish…]

And then at the exact moment that the first thread ends, the second one continues, takes a while to compute, and finally finishes.

This is really bad now that we’re looking at finally running some production jobs.

Mussel_John · September 21, 2010, 1:42am

I can confirm that I see this problem with an S1070 connected to an HP DL160 G5.

[username@nv1 ~]$ nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 1
COMPUTE mode rules for GPU 3: 1

[username@nv1 cuda]$ ./multi 0
About to SetDeviceFlags
SetDeviceFlags finished.
cudaHostAlloc finished.
Starting thread for compute on device 0 : Tesla T10 Processor
All threads launched
CUDA failure in thread HostGetDevicePointer

I see this failure in at least one of the threads every time I try the three tests described above.

This is with CUDA 3.1 and the latest driver (as of last week: 256.53).

When I change the mode back to 0 (Normal mode rather than compute exclusive) I don’t see any errors. However, if I start one thread approx. 5 seconds after the first, the second one gets blocked here:

[username@nv1 cuda]$ ./multi 2
About to SetDeviceFlags
SetDeviceFlags finished.

[blocks here waiting for the first thread to finish…]

And then at the exact moment that the first thread ends, the second one continues, takes a while to compute, and finally finishes.

This is really bad now that we’re looking at finally running some production jobs.

tmurray · September 21, 2010, 5:18am

Try with the CUDA 3.2 driver–that should solve any problems with exclusive mode.

tmurray · September 21, 2010, 5:18am

Try with the CUDA 3.2 driver–that should solve any problems with exclusive mode.

Sarnath · February 18, 2011, 7:36am

SPWorley,

I created a bug-report based on your code which would ULF even on single-GPU system.
I removed pthreads, zero-copy etc… to produce the most basic code that will cause ULFs,

NV has confirmed and said that the fix will be avilable in next CUDA release,
I will notify this thread when I get notification of the fix in an intermediate driver release,

BEst Regards,
Sarnath

Sarnath · March 4, 2011, 6:36am

This bug must have been fixed with drivers 270.26 and later.

SPWorley, If you have time, You can check this out. However CTRL_C behaviour will not be fixed with this driver.

I am waiting for CUDA 4.0 driver… I will test for both these behavior with the new driver,

Sarnath · March 10, 2011, 4:33am

I tried with 270.27 and the bug still persists. Have updated bug report. Awaiting the fix…

Topic		Replies	Views
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23204	October 7, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20147	May 4, 2007
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52084	June 7, 2022
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
Strange behavior with multiple host threads using cuFFT CUDA Programming and Performance	5	1602	March 21, 2014
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19336	November 26, 2015
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
Un-specified Launch Failures on CTRL_C Driver corrupting contexts ?? CUDA Programming and Performance	11	962	February 8, 2011
multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes CUDA Programming and Performance	39	9161	July 1, 2010
Error: Failed to suspend device for CUDA device 0 CUDA Programming and Performance	8	4550	January 4, 2023

Failure with independent devices on independent processes Try it yourself!

Related topics