CUDA 9.1 on Ubuntu 16.04 ppc64le - all CUDA examples crash: "remap_4k_pfn called with wrong pfn value"

samu_gabor · May 8, 2018, 1:08am

Configuration →

uname -a

Linux kilenc 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:41:53 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

cat /etc/release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION=“Ubuntu 16.04.4 LTS”
NAME=“Ubuntu”
VERSION=“16.04.4 LTS (Xenial Xerus)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 16.04.4 LTS”
VERSION_ID=“16.04”
HOME_URL=“http://www.ubuntu.com/”
SUPPORT_URL=“http://help.ubuntu.com/”
BUG_REPORT_URL=“Bugs : Ubuntu”
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Using CUDA package:

dpkg -i cuda-repo-ubuntu1604_9.1.85-1_ppc64el.deb

Device: Tesla K20c shows up in lspci output

lspci

…
…
0033:01:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Problem - all CUDA examples fail with out of memory type error. Here I use vectorAdd as an example:

root@hostname:/usr/local/cuda/samples/0_Simple/vectorAdd# make
/usr/local/cuda-9.1/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd.o -c vectorAdd.cu
/usr/local/cuda-9.1/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd vectorAdd.o
mkdir -p ../../bin/ppc64le/linux/release
cp vectorAdd ../../bin/ppc64le/linux/release

./vectorAdd

[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!

System log shows following trace:

[ 493.376015] remap_4k_pfn called with wrong pfn value
[ 493.376054] ------------[ cut here ]------------
[ 493.376204] WARNING: CPU: 52 PID: 7000 at ./arch/powerpc/include/asm/book3s/64/hash-64k.h:105 nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.376205] Modules linked in: nvidia_uvm(POE) idt_89hpesx at24 ofpart opal_prd cmdlinepart powernv_flash uio_pdrv_genirq ipmi_powernv mtd vmx_crypto ibmpowernv uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport autofs4 ses enclosure scsi_transport_sas btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvidia(POE) crct10dif_vpmsum crc32c_vpmsum ipmi_devintf ipmi_msghandler tg3 aacraid
[ 493.376231] CPU: 52 PID: 7000 Comm: vectorAdd Tainted: P W OE 4.13.0-39-generic #44~16.04.1-Ubuntu
[ 493.376233] task: c000000397e92b00 task.stack: c000000397f70000
[ 493.376234] NIP: c00800000aafa8a4 LR: c00800000aafa8a0 CTR: 0000000000000000
[ 493.376235] REGS: c000000397f737a0 TRAP: 0700 Tainted: P W OE (4.13.0-39-generic)
[ 493.376235] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 493.376240] CR: 22042484 XER: 20040000
[ 493.376240] CFAR: c00000000017d700 SOFTE: 1
GPR00: c00800000aafa8a0 c000000397f73a20 c00800000ba15300 0000000000000028
GPR04: 0000000000000000 0000000000000000 c0002003fc09f908 687469772064656c
GPR08: 00000003fdc10000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000335 c00000000faa3c00 000000005c000002 0000000000000000
GPR16: 0000000000000003 0000000000000000 0000000000000008 c000000001791d18
GPR20: 00007b14d8390000 0000000000000000 0000000000000000 0000000000000001
GPR24: 000620c180000000 0000000000002000 0000000000010000 c00020037e0f7000
GPR28: 0000000000000003 00620c1800000000 000620c180000fff c000000399675910
[ 493.376395] NIP [c00800000aafa8a4] nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.376534] LR [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia]
[ 493.376535] Call Trace:
[ 493.376675] [c000000397f73a20] [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia] (unreliable)
[ 493.376815] [c000000397f73ae0] [c00800000aafa940] nvidia_mmap+0x70/0xd0 [nvidia]
[ 493.376953] [c000000397f73b20] [c00800000aaf00ec] nvidia_frontend_mmap+0x5c/0x80 [nvidia]
[ 493.376960] [c000000397f73b40] [c000000000313230] mmap_region+0x490/0x6f0
[ 493.376961] [c000000397f73c20] [c000000000313890] do_mmap+0x400/0x4f0
[ 493.376964] [c000000397f73cb0] [c0000000002e21e4] vm_mmap_pgoff+0x114/0x160
[ 493.376965] [c000000397f73d90] [c0000000003105c4] SyS_mmap_pgoff+0xf4/0x300
[ 493.376968] [c000000397f73e10] [c000000000014d54] SyS_mmap+0x44/0x80
[ 493.376971] [c000000397f73e30] [c00000000000b184] system_call+0x58/0x6c
[ 493.376972] Instruction dump:
[ 493.376974] e8848710 38600004 48004acd 60000000 3860ffea 4bfffcd0 3860ffde 4bfffcc8
[ 493.376978] 3c620000 e8638728 489d4545 e8410018 <0fe00000> 3860fff5 4bfffb9c 3c620000
[ 493.376982] —[ end trace 3e91fed7b8057e53 ]—
[ 493.399641] remap_4k_pfn called with wrong pfn value
[ 493.399673] ------------[ cut here ]------------
[ 493.399820] WARNING: CPU: 52 PID: 7000 at ./arch/powerpc/include/asm/book3s/64/hash-64k.h:105 nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.399821] Modules linked in: nvidia_uvm(POE) idt_89hpesx at24 ofpart opal_prd cmdlinepart powernv_flash uio_pdrv_genirq ipmi_powernv mtd vmx_crypto ibmpowernv uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport autofs4 ses enclosure scsi_transport_sas btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvidia(POE) crct10dif_vpmsum crc32c_vpmsum ipmi_devintf ipmi_msghandler tg3 aacraid
[ 493.399848] CPU: 52 PID: 7000 Comm: vectorAdd Tainted: P W OE 4.13.0-39-generic #44~16.04.1-Ubuntu
[ 493.399850] task: c000000397e92b00 task.stack: c000000397f70000
[ 493.399851] NIP: c00800000aafa8a4 LR: c00800000aafa8a0 CTR: 0000000000000350
[ 493.399852] REGS: c000000397f737a0 TRAP: 0700 Tainted: P W OE (4.13.0-39-generic)
[ 493.399853] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 493.399857] CR: 28042424 XER: 20040000
[ 493.399858] CFAR: c00000000017d700 SOFTE: 1
GPR00: c00800000aafa8a0 c000000397f73a20 c00800000ba15300 0000000000000028
GPR04: 0000000000000000 0000000000000000 c0002003fc0a05a4 687469772064656c
GPR08: 00000003fdc10000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000350 c00000000faa3c00 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 000000000000000c c000000001791d18
GPR20: 00007b14c0010000 0000000000000000 0000000000000000 0000000000000001
GPR24: 000622000fde1000 0000000000002000 0000000000010000 c00020037e0f7000
GPR28: 0000000000000003 00622000fde10000 000622000fde1fff c000000399670af0
[ 493.400011] NIP [c00800000aafa8a4] nvidia_mmap_helper+0x634/0x660 [nvidia]
[ 493.400151] LR [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia]
[ 493.400152] Call Trace:
[ 493.400292] [c000000397f73a20] [c00800000aafa8a0] nvidia_mmap_helper+0x630/0x660 [nvidia] (unreliable)
[ 493.400432] [c000000397f73ae0] [c00800000aafa940] nvidia_mmap+0x70/0xd0 [nvidia]
[ 493.400570] [c000000397f73b20] [c00800000aaf00ec] nvidia_frontend_mmap+0x5c/0x80 [nvidia]
[ 493.400575] [c000000397f73b40] [c000000000313230] mmap_region+0x490/0x6f0
[ 493.400577] [c000000397f73c20] [c000000000313890] do_mmap+0x400/0x4f0
[ 493.400579] [c000000397f73cb0] [c0000000002e21e4] vm_mmap_pgoff+0x114/0x160
[ 493.400581] [c000000397f73d90] [c0000000003105c4] SyS_mmap_pgoff+0xf4/0x300
[ 493.400584] [c000000397f73e10] [c000000000014d54] SyS_mmap+0x44/0x80
[ 493.400588] [c000000397f73e30] [c00000000000b184] system_call+0x58/0x6c
[ 493.400589] Instruction dump:
[ 493.400591] e8848710 38600004 48004acd 60000000 3860ffea 4bfffcd0 3860ffde 4bfffcc8
[ 493.400595] 3c620000 e8638728 489d4545 e8410018 <0fe00000> 3860fff5 4bfffb9c 3c620000
[ 493.400599] —[ end trace 3e91fed7b8057e54 ]—

Robert_Crovella · May 9, 2018, 12:21am

Whatever you are running with K20c is not a supported configuration on PPC with CUDA 9/9.1.

samu_gabor · May 9, 2018, 2:12pm

Bob, is there a support matrix you could point me to for CUDA, NVIDIA GPUs and ppc64le? Given your feedback, I presume it’s K80 and above that may work on this platform?

Robert_Crovella · May 9, 2018, 2:43pm

The linux install guide. Read all of section 1.1 carefully:

[url]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements[/url]

Topic		Replies	Views
CUDA 9.1 setup and NVIDIA 390 driver not found on Ubuntu 16.04 CUDA Setup and Installation	14	12554	March 15, 2018
Cuda broken in 396.24.02 and 396.24.10 Vulkan beta drivers on Linux Linux	46	9836	September 11, 2018
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3339	June 8, 2018
Thrust error with cuda 9.1 with windows 10 CUDA Programming and Performance	0	1045	April 10, 2018
IBM POWER 9 / PPC64 status CUDA Setup and Installation	13	8741	January 21, 2020
SDK sample code failures only on samples that launch a kernel CUDA Programming and Performance	17	8897	January 7, 2009
Ubuntu 9.04 - Cuda 2.3 - no device supporting CUDA SLI GTX cards are not recognized by cuda runtime CUDA Programming and Performance	14	22199	October 23, 2009
run on K40 Linux	83	5700	June 29, 2018
Problems with CUDA 9.1 in Ubuntu 16.04 CUDA Setup and Installation	36	24848	May 15, 2018
Cannot run example "even easier introduction to CUDA" CUDA Programming and Performance	7	734	January 14, 2019