IBM Power8: CUDA driver version is insufficient for CUDA runtime version

shivipr · November 29, 2016, 7:31pm

Hi,

Here is the issue that I have been facing. I have a S822LC system with 2-socket (10 core) power8 processor and 4x Tesla P100 cards. I have installed Ubuntu 16.04 on the system. I am now trying to install CUDA 8.0 and Nvidia driver 361.93.03 for Tesla P100 cards. And I am getting the following error while trying to execute bandwidhTest from CUDA samples:

cudaGetDeviceProperties returned 35
→ CUDA driver version is insufficient for CUDA runtime version
CUDA error at bandwidthTest.cu:242 code=35(cudaErrorInsufficientDriver) “cudaSetDevice(currentDevice)”

Here is the output of the nvidia-smi command:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvidia-smi detects the 4x Tesla P100 cards but I still get the "CUDA driver version is insufficient for CUDA runtime version"error while trying to execute CUDA samples. By any chance would you know the right version of CUDA and nvidia driver in this case?

Thanks
Shivi

Robert_Crovella · November 29, 2016, 11:12pm

Did you get the 361.93.03 driver from here:

[url]http://www.nvidia.com/download/driverResults.aspx/110987/en-us[/url]

Did you get the CUDA 8 toolkit from here:

[url]https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el-deb[/url]

(i.e. from here: [url]https://developer.nvidia.com/cuda-downloads[/url])

If so, my suspicion is a runfile installer/deb installer clash. The potential for this is outlined in the linux getting started guide:

[url]https://developer.nvidia.com/compute/cuda/8.0/prod/docs/sidebar/CUDA_Installation_Guide_Linux-pdf[/url]

i.e. section 2.7 here:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

To get things working for you, you could start by just using the cuda 8 deb method to install CUDA 8 (don’t install a driver separately) and it should pull in 361.93.02, and that driver is compatible with tesla P100 also.

To get things working with 361.93.03, I would try the following process, but I have not personally walked through it, so I can’t vouch for it yet:

using the CUDA 8 toolkit Power deb as indicated above, and assuming a clean Ubuntu 16.04 install:

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb
sudo apt-get update
sudo apt-get install cuda-toolkit

Then use the runfile install method using the 361.93.03 runfile installer I indicated above.

If you used other methods to install CUDA 8 toolkit or 361.93.03 driver than what I initially surmised above, I would not pay attention to anything I said here. Instead, if you want help, identify the exact places where you acquired 361.93.03 and CUDA 8 tookit for Power, and identify the exact method you used to install.

shivipr · November 30, 2016, 12:04am

Thanks so much for getting back to me.

I already tried the first approach where I didn’t install any drivers separately and only installed cuda. In that case even the nvidia-smi command didn’t work.

Now, I want to try out the second approach and am trying to find the runfile for 361.93.03 driver but couldn’t.

JSYK, I got the 361.93.03 driver (deb package) from here :
http://www.nvidia.com/download/driverResults.aspx/109509/en-us

I got the one for Ubuntu, the one that you mentioned is for RHEL.

Any other suggestions?

Robert_Crovella · November 30, 2016, 12:18am

Yes, my mistake there is no published runfile, only .deb package for ubuntu and .rpm for RHEL.

What did you use to install CUDA 8? Was it from the deb link I mentioned?

shivipr · November 30, 2016, 12:23am

Yes, it is from the same deb link.

Thanks

Robert_Crovella · November 30, 2016, 3:23am

This worked for me. I just tested it now. A colleague of mine set up an IBM S822LC for HPC system with 4 P100 GPUs in it, with a fresh load of ubuntu 16.04.1 LTS:

# uname -a
Linux nchpc-g0 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:38:24 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

I then grabbed the aforementioned .deb file:

https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el-deb

And did all the following as root:

rename the downloaded file from -deb to .deb
dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb
apt-get update
apt-get install cuda

This process visibly includes installing the nvidia-361 package which is the nvidia driver. You can see clear evidence of the 361.93.02 driver kernel module being built and installed.

After that, again as root, without even a reboot, I did:

# nvidia-smi
Tue Nov 29 22:15:25 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.93.02              Driver Version: 361.93.02                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0002:01:00.0     Off |                    0 |
| N/A   34C    P0    35W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0003:01:00.0     Off |                    0 |
| N/A   31C    P0    36W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 000A:01:00.0     Off |                    0 |
| N/A   33C    P0    35W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 000B:01:00.0     Off |                    0 |
| N/A   31C    P0    36W / 300W |      0MiB / 16280MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you’re not able to duplicate that sequence, I’m not sure what the difference may be.

shivipr · November 30, 2016, 3:56am

Thanks so much for getting back to me. I am trying a fresh install of Ubuntu here. Can you please confirm if you can run bandwidthTest or deviceQuery?

Thanks much.

Best
Shivani

Robert_Crovella · November 30, 2016, 4:25am

So far I have done everything as root. Yes I am able to successfully run CUDA sample codes:

# /usr/local/cuda/samples/bin/ppc64le/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     29517.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     21339.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     449507.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
# /usr/local/cuda/samples/bin/ppc64le/linux/release/deviceQuery
/usr/local/cuda/samples/bin/ppc64le/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   3 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   10 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   11 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS
#

njuffa · November 30, 2016, 6:24am

Interesting to see the host<->device bandwidth for this system, at about twice the PCIe gen3 x16 rate (typically 12 GB/sec)

Robert_Crovella · December 1, 2016, 5:55pm

In fact those numbers (~21GB/s and ~29GB/s) are a bit low. The link peak theoretical throughput is 40GB/s for each direction, and we typically see ~32GB/s there in either direction as a measurement. There is something not quite right with the box I was testing on – for example it may have incorporated some pre-production hardware, as all of this is pretty new.

My intent in posting that was to demonstrate that from a software install perspective, CUDA was functional. I did not mean it to be representative of current hardware behavior from a performance perspective.

Topic		Replies	Views
CUDA/deviceQuery only possible with sudo CUDA Setup and Installation	1	757	November 19, 2018
Problems with CUDA drivers for NVIDIA Hardware CUDA Setup and Installation	9	1250	October 27, 2020
gpu computing sdk 4.0 runtime failures build the sdk succesfully, but the run of any exe failed CUDA Programming and Performance	3	2793	August 8, 2011
Can't make CUDA 9.1 samples running CUDA Setup and Installation	2	540	March 7, 2019
one CUDA card unrecognized in 64bit Win7 CUDA Programming and Performance	5	1697	April 15, 2011
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64156	April 20, 2011
Ubuntu, CUDA 9, dual GTX1070, both (either) recognized, but can only initialize/use one CUDA Setup and Installation	2	1331	August 2, 2018
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1278	February 9, 2024
Failed: CUDA driver version is insufficient for CUDA runtime version Parabricks cuda , containers , ai , driver	8	2017	November 21, 2023
CUDA error, bandwithTest.exe CUDA Setup and Installation	12	2472	January 21, 2019

IBM Power8: CUDA driver version is insufficient for CUDA runtime version

Related topics