[SOLVED] cudaHostRegister returns "unknown error" on multi-GPU system

I am running a relatively large project on a multi GPU system containing one GTX 980Ti and one GTX 1070. The project only uses one of the GPUs when executed.

I compiled with the arch flags for both compute capabilities:

--generate-code arch=compute_52,code=sm_52
 --generate-code arch=compute_61,code=sm_61

In the code each of the following API calls has its own error handling and just to be sure the error code really originated from the API call I also ran it once with a call to cudaGetLastError() preceding each actual API call.

In the code, the first cuda related commands executed are:

...
  int device{ 0 };
  int driver_version{ 0 };
  int runtime_version{ 0 };
  cudaDeviceProp device_property;

  cudaSetDevice(device);
  cudaGetDeviceProperties(&device_property,  device );
  cudaDriverGetVersion(&driver_version );
  cudaRuntimeGetVersion( &runtime_version );

followed by outputting their return values, which works correctly and outputs the expected values.

Then the next call is to cudaHostRegister to page-lock a buffer provided by a different module of the project to allow for asynchronous data transfers.

cudaHostRegister(h_Ptr, bytes, cudaHostRegisterPortable);

this fails with error (30) “unknown error”.

Some investigation showed that this error only occurs when both GPUs are visible. As soon as I provide

CUDA_VISIBLE_DEVICES=0,-1,1

or

CUDA_VISIBLE_DEVICES=1,-1,0

to make either of the GPUs invisible the project runs without issues on either of the GPUs!

Does anyone have an idea what might cause this behavior?

System:

  • CUDA 8.0.44
  • Nvidia drivers tested: 367.44, 370.28
  • Ubuntu 14.04.5 LTS
  • kernel: 4.4.0-45-lowlatency

nvidia-smi returns the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:05:00.0      On |                  N/A |
| 26%   62C    P8    26W / 250W |    362MiB /  6077MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 0000:06:00.0     Off |                  N/A |
| 27%   37C    P8    10W / 100W |      1MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Try

-gencode arch=compute_52,code=compute_52 \
    -gencode arch=compute_61,code=compute_61

thank you for the suggestion, but I should probably have mentioned that I also tried it with JIT compilation and it did not change the behavior.

Also the correct device code for each compute capability is present since it runs successfully on either of the GPUs without recompilation if I only make one of the GPUs visible to CUDA via the CUDA_VISIBLE_DEVICES environment variable.

UPDATE:

The error seems to only originate from cudaHostRegister().

If I comment out the host registration and just use non page-locked memory the whole project including all other API calls and kernels runs successfully even if both cards are visible. (Of course it runs slower, since without page-locked memory all the async mem copies revert to synchronous copies)

What could potentially go wrong with cudaHostRegister if there are two GPUs of different compute in the system?

To my understanding cudaHostRegister should work independent of the GPU. And why would it work on either GPU if the other one is set to invisible?

If I find the time I will try to build a small test case that reproduces this behavior.

I just did a quick search through the CUDA 8.0 Samples and tried the simpleStream program which uses cudaHostRegister and voila:

duadsf00020:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleStreams$ ./simpleStreams 
[ simpleStreams ]

Device synchronization method set to = 0 (Automatic Blocking)
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "GeForce GTX 1070" with compute capability 6.1

Device: <GeForce GTX 1070> canMapHostMemory: Yes
> CUDA Capable: SM 6.1 hardware
> 14 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 1792 (Cores)
> scale_factor = 1.0000
> array_size   = 16777216

> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleAuto)
> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)
> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory
CUDA error at simpleStreams.cu:116 code=30(cudaErrorUnknown) "cudaHostRegister(*ppAligned_a, nbytes, cudaHostRegisterMapped)"

if I make one of the GPU’s invisible it works on either one of them.
GTX 980 Ti:

duadsf00020:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleStreams$ export CUDA_VISIBLE_DEVICES=1,-1,0
duadsf00020:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleStreams$ ./simpleStreams 
[ simpleStreams ]

Device synchronization method set to = 0 (Automatic Blocking)
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "GeForce GTX 980 Ti" with compute capability 5.2

Device: <GeForce GTX 980 Ti> canMapHostMemory: Yes
> CUDA Capable: SM 5.2 hardware
> 22 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 2816 (Cores)
> scale_factor = 1.0000
> array_size   = 16777216

> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleAuto)
> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)
> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory

Starting Test
memcopy:        5.12
kernel:         3.02
non-streamed:   7.95
4 streams:      5.30
-------------------------------

Or the other one, GTX 1070:

duadsf00020:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleStreams$ export CUDA_VISIBLE_DEVICES=0,-1,1
duadsf00020:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleStreams$ ./simpleStreams 
[ simpleStreams ]

Device synchronization method set to = 0 (Automatic Blocking)
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "GeForce GTX 1070" with compute capability 6.1

Device: <GeForce GTX 1070> canMapHostMemory: Yes
> CUDA Capable: SM 6.1 hardware
> 14 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 1792 (Cores)
> scale_factor = 1.0000
> array_size   = 16777216

> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleAuto)
> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)
> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory

Starting Test
memcopy:        5.13
kernel:         2.62
non-streamed:   7.58
4 streams:      5.27
-------------------------------

Hence I can rule out any compilation or implementation errors within our project.

Any ideas?

Could you please try a 375 driver? There was a bug in the previous drivers for host register on multi GPU configurations.

http://www.nvidia.com/download/driverResults.aspx/111596/en-us

thank you, that solved it.

Should have tried this driver first. I was not yet aware of any cudaHostRegister bug in the previous drivers.