cudaFreeHost very slow for buffers > 2GB [newbee here!]

OP indicated (#15) that they are using a custom-built machine, which sort of rules out AWS, I’d say :-)

As for questions on the internet: I don’t need convincing that fewer should be asked. People were able to resolve technical issues before there ever was an internet! But I am afraid that is going against currently prevailing trends (witness Stackoverflow one prominent example), so I just go with the flow.

Hi Robert, all this is done on (my private) bare metal machine. As a hobby really. I am also working with CUDA professionally, but there we use a standard image on AWS which works of course.

I just used “strace -tt -T -r”. It clearly shows that the time is indeed spent before the munmap:

23:54:28.995515 (+     0.355269) close(24) = 0 <0.000018>
23:54:28.995592 (+     0.000061) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc62100e30) = 0 <0.005839>
23:54:29.001491 (+     0.005905) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc62100e30) = 0 <0.020055>
23:54:29.021658 (+     0.020174) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc62100e30) = 0 <0.005238>
23:54:29.027013 (+     0.005346) ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc62101a00) = 0 <0.000064>
23:54:29.027110 (+     0.000091) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc621019c0) = 0 <0.000006>
23:54:29.027140 (+     0.000029) ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc62101a00) = 0 <0.002118>
23:54:29.029345 (+     0.002210) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc621019c0) = 0 <0.000008>
23:54:29.029386 (+     0.000035) ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc62101a00) = 0 <0.000073>
23:54:29.029483 (+     0.000096) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc621019c0) = 0 <0.000005>
23:54:29.029519 (+     0.000036) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc62101ad0) = 0 <39.732633>
23:55:08.762222 (+    39.732711) munmap(0x7f6538f4a000, 2147487744) = 0 <0.038809>
23:55:08.801166 (+     0.038946) exit_group(0) = ?
23:55:08.865858 (+     0.064697) +++ exited with 0 +++

As the current state is “custom hardware + unsupported kernel”. For a systematic analysis, I’d have to downgrade to a supported version of Linux, and redo the experiments, to check if the hardware works.

As for questions on the internet: Believe me, I searched and experimented several hours before posting here (and tried producing a minimal example in order to avoid wasting everybody’s time). I also hate the mindless “search+copy+paste+whine” approach that seems to be so prevalent these days.

So thanks for the support, I’m going to keep you posted when I learn something new. At least I didn’t make a total rookie mistake :-)

So the time is spent during the munmap, not before. [see below]

As Njuffa mentioned, munmap is expected to be an expensive call, although maybe not that expensive.

It is probably a good idea trying to reproduce the problem on a supported operating system. Taking an educated guess, you may however be able to take a shortcut by checking if your problem persists when booting your existing unsupported installation under a supported stock kernel from kernel.org. If it doesn’t, you a) can further nail it down by checking a stock 5.0 kernel and b) you have some sort of workaround for your immediate problem.

Hmm, from the man page (and experimenting with strace), I think that this line

23:54:29.029519 (+     0.000036) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc62101ad0) = 0 <39.732633>

means that it spends 39 seconds in the call before the munmap and
this line here

23:55:08.762222 (+    39.732711) munmap(0x7f6538f4a000, 2147487744) = 0 <0.038809>

means that the call to munmap started with an relative offset of 39 seconds to the previous one (and just took 0.038809 seconds).

Do you think that booting a different major version of the kernel really is an option? I mean, what will this do to all the modules and config files?
You are right, I could/should probably try before completely reinstalling the OS.

Oops, my bad again.

So the time is indeed spent in the driver. That makes it less, but still somewhat, likely the problem may disappear with a supported kernel version.

I personally don’t install CUDA on an unsupported system if I am not prepared to install a custom-built, supported kernel and a supported compiler. While this shouldn’t be seen as a recommendation, I’ve had quite a good rate of success with that approach for most of the existence of CUDA.

Edit: Modules and other kernel-version-dependent files are themselves versioned, and installing multiple kernels in parallel is not a problem. The biggest problem in your case may be that all supported kernels are older than the one used by your distribution, so some expected functionality might not be present. Usually distributions are somewhat forgiving in that case, with just some features becoming unavailable.

Thanks, tera, for the advice!

Sorry for bothering you again.
I now have downgraded to Ubuntu 18.04, which is supposed to be supported, but the problem remains:

stm@m1:/tmp$ sudo strace -tt -T -r ./2gb
...
...
     0.000104 close(24)                 = 0 <0.000007>
     0.000069 close(25)                 = 0 <0.000005>
     0.000080 clock_gettime(CLOCK_MONOTONIC_RAW, {tv_sec=756, tv_nsec=459358952}) = 0 <0.000007>
     0.000121 ioctl(6, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38), 0x7ffc81eff6c0) = 0 <0.334532>
     0.334666 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7ffc81eff230) = 0 <0.008765>
     0.008850 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7ffc81eff230) = 0 <0.022592>
     0.022707 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7ffc81eff230) = 0 <0.005447>
     0.005597 ioctl(4, _IOC(0, 0, 0x42, 0), 0x7ffc81effe00) = 0 <0.000173>
     0.000251 ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffc81effdc0) = 0 <0.000009>
     0.000081 ioctl(4, _IOC(0, 0, 0x42, 0), 0x7ffc81effe00) = 0 <0.002126>
     0.002237 ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffc81effdc0) = 0 <0.000015>
     0.000115 ioctl(4, _IOC(0, 0, 0x42, 0), 0x7ffc81effe00) = 0 <0.000055>
     0.000132 ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffc81effdc0) = 0 <0.000006>
     0.000091 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc81effed0) = 0 <34.012242>
    34.012368 munmap(0x7fd2d87c2000, 2147487744) = 0 <0.049574>
     0.049719 exit_group(0)             = ?
     0.079795 +++ exited with 0 +++
stm@m1:/tmp$ uname -a
Linux m1 4.18.0-20-generic #21~18.04.1-Ubuntu SMP Wed May 8 08:43:37 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This is my kernel:

stm@m1:/tmp$ uname -r
4.18.0-20-generic

Here is the software that is installed:

stm@m1:/tmp$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-common-430/bionic,bionic,now 430.14-0ubuntu0~gpu18.04.1 all [installed,automatic]
libnvidia-compute-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-decode-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-encode-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-fbc1-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-gl-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
libnvidia-ifr1-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-compute-utils-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-cuda-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-cuda-doc/bionic,bionic,now 9.1.85-3ubuntu1 all [installed,automatic]
nvidia-cuda-gdb/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1 amd64 [installed]
nvidia-dkms-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-driver-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed]
nvidia-kernel-common-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-kernel-source-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-opencl-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-prime/bionic-updates,bionic-updates,now 0.8.8.2 all [installed,automatic]
nvidia-profiler/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-settings/bionic,now 418.56-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-utils-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]
nvidia-visual-profiler/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-430/bionic,now 430.14-0ubuntu0~gpu18.04.1 amd64 [installed,automatic]

So maybe there is something wrong with my hardware/setup?

Do you have any recommendations on how to approach the problem?

As Robert Crovella pointed out in #13, if you can reproduce the issue on a supported platform, it may be a good idea to file a bug. It looks like you haven’t quite matched the officially supported configuration yet (see Linux Installation Guide).

I don’t recall ever coming across this type of issue before, so other than wild guesses, I cannot offer advice. It may be a software issue with the NVIDIA driver stack triggered by specifics of your software environment (I will spare you my standard rant about Ubuntu), it may be some weird interaction between NVIDIA software and AMD processors, it may be demons that need to be exorcised :-)

The reason it is important to use only supported software components as specified in the Linux Installation Guide when submitting a bug is that the first thing NVIDIA has to do is independently reproduce your observations in-house and they are not equipped to handle random mix & match scenarios.

Thanks njuffa for pointing out this guide. I’m going to follow it as closely as I can.

Good news, I suppose:

I just reinstalled everything from scratch and now the problem has disappeared.
It is the same kernel and all the Ubuntu packages, the driver and the toolkit versions are the same.

0.000086 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe2f8b19f0) = 0 <0.035730>
     0.035852 munmap(0x7fe9981e6000, 2145390592) = 0 <0.047807>

Thanks everyone here.

Sorry to resurrect this old thread, but this problem has reared its head for me, but with slightly different symptoms. I’m running Arch Linux in a system with 64GB RAM, with an AMD Threadripper 2990WX, two 1080TIs and a Titan RTX. My driver version is 435.21 and I’m running CUDA version 10.1. Running the original test script that was posted here I get the following:

ALLOC_HOST   SizeMB=100 Duration=0.0294687
FREE_HOST    SizeMB=100 Duration=0.0118426
ALLOC_HOST   SizeMB=350 Duration=0.100762
FREE_HOST    SizeMB=350 Duration=0.040332
ALLOC_HOST   SizeMB=600 Duration=0.172851
FREE_HOST    SizeMB=600 Duration=0.068587
ALLOC_HOST   SizeMB=850 Duration=0.247169
FREE_HOST    SizeMB=850 Duration=0.0968949
ALLOC_HOST   SizeMB=1100 Duration=0.320535
FREE_HOST    SizeMB=1100 Duration=0.124565
ALLOC_HOST   SizeMB=1350 Duration=0.395487
FREE_HOST    SizeMB=1350 Duration=0.151575
ALLOC_HOST   SizeMB=1600 Duration=0.464366
FREE_HOST    SizeMB=1600 Duration=0.179427
ALLOC_HOST   SizeMB=1850 Duration=0.541683
FREE_HOST    SizeMB=1850 Duration=0.211357
ALLOC_HOST   SizeMB=2100 Duration=0.617531
FREE_HOST    SizeMB=2100 Duration=38.8949

Running the same exact strace command that was reference earlier in this thread, I got slightly different output:

13:28:42.814088 (+     0.000090) mmap(NULL, 167768064, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f43a4001000 <0.000009>
13:28:42.814119 (+     0.000031) munmap(0x7f43a4001000, 33550336) = 0 <0.000009>
13:28:42.814150 (+     0.000030) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffcf42c7d70) = 0 <0.000156>
13:28:42.814335 (+     0.000184) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffcf42c7790) = 0 <0.000081>
13:28:42.814824 (+     0.000489) brk(0x556fad7ba000) = 0x556fad7ba000 <0.000008>
13:28:42.815126 (+     0.000302) brk(0x556fad7e5000) = 0x556fad7e5000 <0.000008>
13:28:42.815250 (+     0.000123) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffcf42c85d0) = 0 <0.007582>
13:28:42.822882 (+     0.007638) pipe2([28, 29], O_CLOEXEC) = 0 <0.000016>
13:28:42.822938 (+     0.000051) fcntl(28, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000009>
13:28:42.822975 (+     0.000036) fcntl(29, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000008>
13:28:42.823008 (+     0.000032) write(17, "\253", 1) = 1 <0.000009>
13:28:42.823067 (+     0.000059) openat(AT_FDCWD, "/root/.nv/ComputeCache/index", O_RDWR) = 30 <0.002307>
13:28:42.825417 (+     0.002355) fcntl(30, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 <0.000016>
13:28:42.825473 (+     0.000051) fstat(30, {st_mode=S_IFREG|0644, st_size=336, ...}) = 0 <0.000019>
13:28:42.825522 (+     0.000048) lseek(30, 0, SEEK_SET) = 0 <0.000012>
13:28:42.825557 (+     0.000034) read(30, "A\0\0\0\24\0\0\0006l#\0\0\0\0\0`AB\223\1\275\20,\341\7\0\0\0\0\0\0"..., 4096) = 336 <0.000127>
13:28:42.825711 (+     0.000154) openat(AT_FDCWD, "/root/.nv/ComputeCache/f/1/cb033695b3a10f", O_RDONLY) = 31 <0.000228>
13:28:42.825966 (+     0.000255) fstat(31, {st_mode=S_IFREG|0644, st_size=819, ...}) = 0 <0.000008>
13:28:42.825999 (+     0.000033) read(31, "A\0\0\0W\0\0\0\0\0\0\0\300\2\0\0\0\0\0\0\255\247g\350?|n\262Aug "..., 4096) = 819 <0.000118>
13:28:42.826143 (+     0.000143) fcntl(30, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 <0.000007>
13:28:42.826173 (+     0.000030) close(30) = 0 <0.000007>
13:28:42.826199 (+     0.000025) close(31) = 0 <0.000006>
13:28:42.826258 (+     0.000059) ioctl(8, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38), 0x7ffcf42c8270) = 0 <0.420093>
13:28:43.246424 (+     0.420172) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffcf42c7de0) = 0 <0.004230>
13:28:43.250691 (+     0.004263) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffcf42c7de0) = 0 <0.019436>
13:28:43.270190 (+     0.019507) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffcf42c7de0) = 0 <0.004325>
13:28:43.274671 (+     0.004470) exit_group(0) = ?
13:29:21.686553 (+    38.411895) +++ exited with 0 +++

It doesn’t seem to be hanging on or at the munmap, but rather when exiting the program. Any suggestions on how to proceed debugging this?