CUDA 9.1 on Ubuntu 16.04 ppc64le - all CUDA examples crash: "remap_4k_pfn called with wrong pfn value"

Configuration →

uname -a

Linux kilenc 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:41:53 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

cat /etc/release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION=“Ubuntu 16.04.4 LTS”
NAME=“Ubuntu”
VERSION=“16.04.4 LTS (Xenial Xerus)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 16.04.4 LTS”
VERSION_ID=“16.04”
HOME_URL=“http://www.ubuntu.com/
SUPPORT_URL=“http://help.ubuntu.com/
BUG_REPORT_URL=“http://bugs.launchpad.net/ubuntu/
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Using CUDA package:

dpkg -i cuda-repo-ubuntu1604_9.1.85-1_ppc64el.deb

Device: Tesla K20c shows up in lspci output

lspci



0033:01:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)

nvidia-smi

Mon May 7 21:05:48 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.31 Driver Version: 390.31 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 00000033:01:00.0 Off | 0 |
| 30% 30C P0 54W / 225W | 0MiB / 4743MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Problem - all CUDA examples fail with out of memory type error. Here I use vectorAdd as an example:

root@hostname:/usr/local/cuda/samples/0_Simple/vectorAdd# make
/usr/local/cuda-9.1/bin/nvcc -ccbin g++ -I…/…/common/inc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd.o -c vectorAdd.cu
/usr/local/cuda-9.1/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd vectorAdd.o
mkdir -p …/…/bin/ppc64le/linux/release
cp vectorAdd …/…/bin/ppc64le/linux/release

./vectorAdd

[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!

System log shows following trace:

[ 493.376015] remap_4k_pfn called with wrong pfn value
[ 493.376054] ------------[ cut here ]------------
[ 493.376204] WARNING: CPU: 52 PID: 7000 at ./arch/powerpc/include/asm/book3s/64/hash-64k.h:105 nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.376205] Modules linked in: nvidia_uvm(POE) idt_89hpesx at24 ofpart opal_prd cmdlinepart powernv_flash uio_pdrv_genirq ipmi_powernv mtd vmx_crypto ibmpowernv uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport autofs4 ses enclosure scsi_transport_sas btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvidia(POE) crct10dif_vpmsum crc32c_vpmsum ipmi_devintf ipmi_msghandler tg3 aacraid
[ 493.376231] CPU: 52 PID: 7000 Comm: vectorAdd Tainted: P W OE 4.13.0-39-generic #44~16.04.1-Ubuntu
[ 493.376233] task: c000000397e92b00 task.stack: c000000397f70000
[ 493.376234] NIP: c00800000aafa8a4 LR: c00800000aafa8a0 CTR: 0000000000000000
[ 493.376235] REGS: c000000397f737a0 TRAP: 0700 Tainted: P W OE (4.13.0-39-generic)
[ 493.376235] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 493.376240] CR: 22042484 XER: 20040000
[ 493.376240] CFAR: c00000000017d700 SOFTE: 1
GPR00: c00800000aafa8a0 c000000397f73a20 c00800000ba15300 0000000000000028
GPR04: 0000000000000000 0000000000000000 c0002003fc09f908 687469772064656c
GPR08: 00000003fdc10000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000335 c00000000faa3c00 000000005c000002 0000000000000000
GPR16: 0000000000000003 0000000000000000 0000000000000008 c000000001791d18
GPR20: 00007b14d8390000 0000000000000000 0000000000000000 0000000000000001
GPR24: 000620c180000000 0000000000002000 0000000000010000 c00020037e0f7000
GPR28: 0000000000000003 00620c1800000000 000620c180000fff c000000399675910
[ 493.376395] NIP [c00800000aafa8a4] nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.376534] LR [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia]
[ 493.376535] Call Trace:
[ 493.376675] [c000000397f73a20] [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia] (unreliable)
[ 493.376815] [c000000397f73ae0] [c00800000aafa940] nvidia_mmap+0x70/0xd0 [nvidia]
[ 493.376953] [c000000397f73b20] [c00800000aaf00ec] nvidia_frontend_mmap+0x5c/0x80 [nvidia]
[ 493.376960] [c000000397f73b40] [c000000000313230] mmap_region+0x490/0x6f0
[ 493.376961] [c000000397f73c20] [c000000000313890] do_mmap+0x400/0x4f0
[ 493.376964] [c000000397f73cb0] [c0000000002e21e4] vm_mmap_pgoff+0x114/0x160
[ 493.376965] [c000000397f73d90] [c0000000003105c4] SyS_mmap_pgoff+0xf4/0x300
[ 493.376968] [c000000397f73e10] [c000000000014d54] SyS_mmap+0x44/0x80
[ 493.376971] [c000000397f73e30] [c00000000000b184] system_call+0x58/0x6c
[ 493.376972] Instruction dump:
[ 493.376974] e8848710 38600004 48004acd 60000000 3860ffea 4bfffcd0 3860ffde 4bfffcc8
[ 493.376978] 3c620000 e8638728 489d4545 e8410018 <0fe00000> 3860fff5 4bfffb9c 3c620000
[ 493.376982] —[ end trace 3e91fed7b8057e53 ]—
[ 493.399641] remap_4k_pfn called with wrong pfn value
[ 493.399673] ------------[ cut here ]------------
[ 493.399820] WARNING: CPU: 52 PID: 7000 at ./arch/powerpc/include/asm/book3s/64/hash-64k.h:105 nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.399821] Modules linked in: nvidia_uvm(POE) idt_89hpesx at24 ofpart opal_prd cmdlinepart powernv_flash uio_pdrv_genirq ipmi_powernv mtd vmx_crypto ibmpowernv uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport autofs4 ses enclosure scsi_transport_sas btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvidia(POE) crct10dif_vpmsum crc32c_vpmsum ipmi_devintf ipmi_msghandler tg3 aacraid
[ 493.399848] CPU: 52 PID: 7000 Comm: vectorAdd Tainted: P W OE 4.13.0-39-generic #44~16.04.1-Ubuntu
[ 493.399850] task: c000000397e92b00 task.stack: c000000397f70000
[ 493.399851] NIP: c00800000aafa8a4 LR: c00800000aafa8a0 CTR: 0000000000000350
[ 493.399852] REGS: c000000397f737a0 TRAP: 0700 Tainted: P W OE (4.13.0-39-generic)
[ 493.399853] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 493.399857] CR: 28042424 XER: 20040000
[ 493.399858] CFAR: c00000000017d700 SOFTE: 1
GPR00: c00800000aafa8a0 c000000397f73a20 c00800000ba15300 0000000000000028
GPR04: 0000000000000000 0000000000000000 c0002003fc0a05a4 687469772064656c
GPR08: 00000003fdc10000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000350 c00000000faa3c00 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 000000000000000c c000000001791d18
GPR20: 00007b14c0010000 0000000000000000 0000000000000000 0000000000000001
GPR24: 000622000fde1000 0000000000002000 0000000000010000 c00020037e0f7000
GPR28: 0000000000000003 00622000fde10000 000622000fde1fff c000000399670af0
[ 493.400011] NIP [c00800000aafa8a4] nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.400151] LR [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia]
[ 493.400152] Call Trace:
[ 493.400292] [c000000397f73a20] [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia] (unreliable)
[ 493.400432] [c000000397f73ae0] [c00800000aafa940] nvidia_mmap+0x70/0xd0 [nvidia]
[ 493.400570] [c000000397f73b20] [c00800000aaf00ec] nvidia_frontend_mmap+0x5c/0x80 [nvidia]
[ 493.400575] [c000000397f73b40] [c000000000313230] mmap_region+0x490/0x6f0
[ 493.400577] [c000000397f73c20] [c000000000313890] do_mmap+0x400/0x4f0
[ 493.400579] [c000000397f73cb0] [c0000000002e21e4] vm_mmap_pgoff+0x114/0x160
[ 493.400581] [c000000397f73d90] [c0000000003105c4] SyS_mmap_pgoff+0xf4/0x300
[ 493.400584] [c000000397f73e10] [c000000000014d54] SyS_mmap+0x44/0x80
[ 493.400588] [c000000397f73e30] [c00000000000b184] system_call+0x58/0x6c
[ 493.400589] Instruction dump:
[ 493.400591] e8848710 38600004 48004acd 60000000 3860ffea 4bfffcd0 3860ffde 4bfffcc8
[ 493.400595] 3c620000 e8638728 489d4545 e8410018 <0fe00000> 3860fff5 4bfffb9c 3c620000
[ 493.400599] —[ end trace 3e91fed7b8057e54 ]—

Whatever you are running with K20c is not a supported configuration on PPC with CUDA 9/9.1.

Bob, is there a support matrix you could point me to for CUDA, NVIDIA GPUs and ppc64le? Given your feedback, I presume it’s K80 and above that may work on this platform?

The linux install guide. Read all of section 1.1 carefully:

[url]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements[/url]