I will likely submit a new bug on the developer site, but before I do I’d love to have other people with multi-GPU try to reproduce it.
The basic problem: When a process starts a kernel on one device, a different process using a different device can be affected by that first process.
It may block until the first process is done (even though it’s using a different device) or it may just fail.
Attached is simple source code that is basically just launches a 30 second busy-loop on the device(s) specified in the command line.
If you have multiple GPUs, give it a try! It compiles with just “nvcc multi.cu -o multi”. If your non-display GPUs are device 0 and device 2 , try running “./multi 0” in one shell and “./multi 2” from a second shell.
Does it fail?
There are more examples in the source code.
I get multiple different failure modes, not always reproducable. I’d like to have a couple more data points for the bug report when I submit it.
#include <iostream>
#include <stdio.h>
#include <ctime>
#include <cassert>
#include <vector>
/* Test tool to show off multi-process problems with multiple CUDA GPUs.
Sept 1 2010 Steve Worley.
public domain.
compile with nvcc multi.cu -o multi
This is an example program which runs a load on one or more CUDA devices, waits for the computes to finish, and exits. Simply pass the device ID(s) as arguments, so for example, "./multi 0 2 3" would run on three devices, #0 #2 and #3.
This tool helps show multi-GPU failure modes. They are not always
reproducable but after a few tries these can happen pretty quickly.
In these examples, I used toolkit 3.0, driver 256.44, Ubuntu 10.4 64 bit.
The system has 2 GTX295s, device numbers 0, 2, 3, and 4. The display adapter
is an embedded nForce 980a, using device 1.
All problems occur only when two different PROCESSES use CUDA apps on DIFFERENT devices simultaneously.
Typical failures:
----
In one shell run ./multi 0
In a different shell run ./multi 2
the second shell will (usually) successfuly SetDeviceFlags() but then hang at the cudaHostAlloc() call before the devices can even be started. When the first process completes (after about 30 seconds) the second process will wake up, and the second device will intitialize and run.
----
In one shell run ./multi 2
In a different shell run ./multi 0
The opposite order of the above.
Typically device 2 will run properly, but the device 0 process will fail
and exit at the HostGetDevicePointer() call.
---
In one shell run ./multi 0 2
In a different shell run ./multi 3 4
The above pair will often fail as the first case, but sometimes will work.
In the cases when it does work, interrupt one of the runs with ^C. After a pause, it will exit. But then the OTHER process (running on independent devices) will also exit.
*/
using std::vector;
vector<int> deviceID;
unsigned int *h_results; // global
__global__ void kernel(int in, int *out)
{
if (0!=threadIdx.x || 0!=blockIdx.x) return;
int v=in;
/* slow useless compute. About 30 seconds on a GTX295. */
for (int i=0; i<0x10000000; i++) {
v=123*v+456;
}
if (v!=1) v=0; // can never happen but compiler doesn't know that
*out=100000+in+v;
}
static void* localThreadFunction(void *ptr)
{
int index= *((int *)ptr);
int ID= deviceID[index];
cudaError_t err;
int *d_result;
/* Flags must be set BEFORE context creation */
err=cudaSetDeviceFlags(cudaDeviceMapHost);
if (cudaSuccess!=err) {
printf("Thread failed to set device flags\n");
exit(0);
}
err=cudaSetDevice(ID);
if (cudaSuccess!=err) {
printf("Thread failed to set device\n");
exit(0);
}
err=cudaHostGetDevicePointer((void **)&d_result, h_results, 0);
if (cudaSuccess!=err) {
printf("CUDA failure in thread HostGetDevicePointer\n");
exit(0);
}
printf("Device %d initialized successfully, starting compute.\n", ID);
kernel<<<64, 64>>>(ID, d_result+index);
cudaThreadSynchronize(); // unnecessary but harmless
printf("Device %d finished compute.\n", ID);
return NULL;
}
static void usage()
{
std::cout << "multi: first_device_ID second_device_ID third...." << std::endl;
exit(0);
}
int main(int argc, char **argv)
{
cudaError_t err;
if (argc<2) usage();
for (int i=1; i<argc; i++) {
int dev=atoi(argv[i]);
if (dev>0) deviceID.push_back(dev);
else if (dev<0) usage();
else if (!isdigit(argv[i][0])) usage();
else deviceID.push_back(0);
}
/* prepare zero-copy memory for a returned result, one word per device */
printf("About to SetDeviceFlags\n");
err=cudaSetDeviceFlags(cudaDeviceMapHost);
if (cudaSuccess!=err) {
printf("CUDA failure calling SetDeviceFlags\n");
exit(0);
}
printf("SetDeviceFlags finished.\n");
cudaHostAlloc((void **)&h_results, deviceID.size()*sizeof(int),
cudaHostAllocMapped|cudaHostAllocPortable);
if (cudaSuccess!=err) {
printf("CUDA failure in HostAlloc\n");
exit(0);
}
printf("cudaHostAlloc finished.\n");
memset(h_results, 0, deviceID.size()*sizeof(int));
/* launch one thread per device to do a compute */
pthread_t thread[16];
int arg[16];
for (int i=0; i<deviceID.size(); i++) {
arg[i]=i;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, deviceID[i]);
printf("Starting thread for compute on device %d : %s\n",
deviceID[i], deviceProp.name);
(void)pthread_create(&thread[i], NULL, localThreadFunction,
(void *)(&arg[i]));
}
printf("All threads launched\n");
for (int i=0; i< deviceID.size(); i++)
pthread_join(thread[i], NULL);
printf("All threads completed.\n");
for (int i=0; i<deviceID.size(); i++) {
printf("Device %d returned %d\n", deviceID[i], h_results[i]);
}
}