cudaFreeHost very slow for buffers > 2GB [newbee here!]

Please be aware that I’m a total CUDA newbee and may be doing something completely stupid.

I’m running these experiments on Linux (Ubuntu 19.04) using this CUDA version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

My test program is very simple: It allocates and frees larger and larger amounts of host memory using cudaMallocHost and cudaFreeHost. I noticed that around 2GB, the cudaFreeHost gets extremely slow.

My machine is equipped with a GeForce RTX 2070 and has driver version 418.56.
and the behavior is the same when I run the code as root user and also when I enlarge the ulimit for locked memory.

Any help here would be greatly appreciated!

Thanks a lot
Stefan

Here is my test code:

#include <iostream>
#include "cuda_runtime.h"
void checkError(cudaError_t result, char const *const func, const char *const file, int const line) 
{
	if (result != cudaSuccess) {
		// Check error and exit applications.
		// Only use this in test applications.
		fprintf(stderr, "CUDA error at %s:%d code=%d \"%s\" \n", file, line, static_cast<unsigned int>(result), func);
		exit(EXIT_FAILURE);
	}
}
#define checkCudaErrors(val) checkError((val), #val, __FILE__, __LINE__)

int main(void)
{

	for (size_t size_mb = 100; size_mb <= 2000+100; size_mb += 250) {
		size_t const size_kb = size_mb * 1024;
		size_t const size_b  = size_kb * 1024;

		bool use_gpu_host_alloc = true;
		bool do_host_alloc = true;
		bool do_device_alloc = false;

		cudaEvent_t start, stop;
		void * host;
		void * device;
    
		// Initialize timer events and start time (record start timer event).
		checkCudaErrors(cudaEventCreate(&start));
		checkCudaErrors(cudaEventCreate(&stop));
		float elapsed_time = 0.0f;

		if (do_host_alloc) {
			checkCudaErrors(cudaEventRecord(start, NULL));
			checkCudaErrors(cudaEventSynchronize(start));
			if (use_gpu_host_alloc) {
				checkCudaErrors(cudaMallocHost(&host, size_b, cudaHostAllocPortable));
			} else {
				host = malloc(size_b);
			}
			checkCudaErrors(cudaEventRecord(stop, NULL));
			checkCudaErrors(cudaEventSynchronize(stop));
			elapsed_time = 0.0f;
			checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start, stop));
			std::cerr << "ALLOC_HOST   SizeMB=" << size_mb << " Duration=" << elapsed_time / 1000.0
				  << std::endl;

			checkCudaErrors(cudaEventRecord(start, NULL));
			checkCudaErrors(cudaEventSynchronize(start));
			if (use_gpu_host_alloc) {
				checkCudaErrors(cudaFreeHost(host));
			} else {
				free(host);
			}
			checkCudaErrors(cudaEventRecord(stop, NULL));
			checkCudaErrors(cudaEventSynchronize(stop));
			elapsed_time = 0.0f;
			checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start, stop));
			std::cerr << "FREE_HOST    SizeMB=" << size_mb << " Duration=" << elapsed_time / 1000.0
				  << std::endl;

		}
		if (do_device_alloc) {
			checkCudaErrors(cudaEventRecord(start, NULL));
			checkCudaErrors(cudaEventSynchronize(start));
			checkCudaErrors(cudaMalloc(&device, size_b));
			checkCudaErrors(cudaEventRecord(stop, NULL));
			checkCudaErrors(cudaEventSynchronize(stop));
			elapsed_time = 0.0f;
			checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start, stop));
			std::cerr << "ALLOC_DEVICE SizeMB=" << size_mb << " Duration=" << elapsed_time / 1000.0
				  << std::endl;

			checkCudaErrors(cudaEventRecord(start, NULL));
			checkCudaErrors(cudaEventSynchronize(start));
			checkCudaErrors(cudaFree(device));
			checkCudaErrors(cudaEventRecord(stop, NULL));
			checkCudaErrors(cudaEventSynchronize(stop));
			elapsed_time = 0.0f;
			checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start, stop));
			std::cerr << "FREE_DEVICE  SizeMB=" << size_mb << " Duration=" << elapsed_time / 1000.0
				  << std::endl << std::endl;
		}
	} 
    
	checkCudaErrors(cudaDeviceReset());
  
	return 0;
}

And here is the output of the program on my machines:

$ ./a.out 
ALLOC_HOST   SizeMB=100 Duration=0.0283465
FREE_HOST    SizeMB=100 Duration=0.00904362
ALLOC_HOST   SizeMB=350 Duration=0.086401
FREE_HOST    SizeMB=350 Duration=0.0304353
ALLOC_HOST   SizeMB=600 Duration=0.143823
FREE_HOST    SizeMB=600 Duration=0.0518581
ALLOC_HOST   SizeMB=850 Duration=0.205972
FREE_HOST    SizeMB=850 Duration=0.0725446
ALLOC_HOST   SizeMB=1100 Duration=0.263455
FREE_HOST    SizeMB=1100 Duration=0.0935576
ALLOC_HOST   SizeMB=1350 Duration=0.320504
FREE_HOST    SizeMB=1350 Duration=0.114782
ALLOC_HOST   SizeMB=1600 Duration=0.38255
FREE_HOST    SizeMB=1600 Duration=0.134775
ALLOC_HOST   SizeMB=1850 Duration=0.440723
FREE_HOST    SizeMB=1850 Duration=0.15505
ALLOC_HOST   SizeMB=2100 Duration=0.493754
FREE_HOST    SizeMB=2100 Duration=36.0846

Please note the 36 seconds that the last FREE_HOST needs!

If I use native malloc instead of cudaMallocHost, this effect is not observed.

Ubuntu 19.04 isn’t a supported environment for CUDA development for any current version of CUDA up through 10.1 Update 1.

How much host memory do you have?

TL;DR: I am not able to reproduce your observation readily.

When I run your code on a RHEL 7 system with 128GB of host memory, 410.79/CUDA 10.0, I get this output:

$ ./t472
ALLOC_HOST   SizeMB=100 Duration=0.0425731
FREE_HOST    SizeMB=100 Duration=0.0158759
ALLOC_HOST   SizeMB=350 Duration=0.147119
FREE_HOST    SizeMB=350 Duration=0.0547848
ALLOC_HOST   SizeMB=600 Duration=0.252625
FREE_HOST    SizeMB=600 Duration=0.0822702
ALLOC_HOST   SizeMB=850 Duration=0.312747
FREE_HOST    SizeMB=850 Duration=0.111664
ALLOC_HOST   SizeMB=1100 Duration=0.406141
FREE_HOST    SizeMB=1100 Duration=0.144883
ALLOC_HOST   SizeMB=1350 Duration=0.498909
FREE_HOST    SizeMB=1350 Duration=0.177601
ALLOC_HOST   SizeMB=1600 Duration=0.59509
FREE_HOST    SizeMB=1600 Duration=0.217494
ALLOC_HOST   SizeMB=1850 Duration=0.68403
FREE_HOST    SizeMB=1850 Duration=0.24336
ALLOC_HOST   SizeMB=2100 Duration=0.777114
FREE_HOST    SizeMB=2100 Duration=0.276707

When I run your code on a Ubuntu 16.04 system with 160GB of host memory, CUDA 10.0, 418.40, I get this output:

$ ./t1
ALLOC_HOST   SizeMB=100 Duration=0.0465029
FREE_HOST    SizeMB=100 Duration=0.0125061
ALLOC_HOST   SizeMB=350 Duration=0.158752
FREE_HOST    SizeMB=350 Duration=0.0422543
ALLOC_HOST   SizeMB=600 Duration=0.269881
FREE_HOST    SizeMB=600 Duration=0.0643942
ALLOC_HOST   SizeMB=850 Duration=0.383459
FREE_HOST    SizeMB=850 Duration=0.0916009
ALLOC_HOST   SizeMB=1100 Duration=0.494459
FREE_HOST    SizeMB=1100 Duration=0.118175
ALLOC_HOST   SizeMB=1350 Duration=0.609286
FREE_HOST    SizeMB=1350 Duration=0.145576
ALLOC_HOST   SizeMB=1600 Duration=0.718573
FREE_HOST    SizeMB=1600 Duration=0.171915
ALLOC_HOST   SizeMB=1850 Duration=0.835
FREE_HOST    SizeMB=1850 Duration=0.199279
ALLOC_HOST   SizeMB=2100 Duration=0.94283
FREE_HOST    SizeMB=2100 Duration=0.226092

Thanks Robert for your response.
My machine has 64GB of RAM:

$ cat /proc/meminfo 
MemTotal:       65951172 kB
MemFree:        59964384 kB
MemAvailable:   61924804 kB
Buffers:          239684 kB
Cached:          2561100 kB
SwapCached:            0 kB
Active:          3676756 kB
Inactive:        1439800 kB
Active(anon):    2324468 kB
Inactive(anon):   352796 kB
Active(file):    1352288 kB
Inactive(file):  1087004 kB
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:               276 kB
Writeback:             0 kB
AnonPages:       2316108 kB
Mapped:           817952 kB
Shmem:            361488 kB
KReclaimable:     226280 kB
Slab:             456892 kB
SReclaimable:     226280 kB
SUnreclaim:       230612 kB
KernelStack:       16880 kB
PageTables:        40580 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    35072732 kB
Committed_AS:   10709996 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:            13184 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     2780312 kB
DirectMap2M:     6576128 kB
DirectMap1G:    58720256 kB

so Host-RAM should be plenty for this test.

Do you think the problem is more hardware of software-related?

There is no involvement with hardware other than host memory here, that I know of. I’m reasonably sure that whatever the issue is, it is software related.

Thanks Robert! That helps a lot.

I somewhere read that cudaMallocHost uses mmap under the hood.
Do you happen to know where the mapped file is?
Maybe there is a configuration problem on my machine.

Do you have any other ideas or should I just try to (somehow) downgrade my machine). I think on AWS, there is an image of Ubuntu18.04 that seems to be working with CUDA.

Plus the host CPU which manipulates the host memory control structures, mostly in serial non-parallelized fashion. Thetrefore, this code should run faster on a faster CPU, in particular one with higher single-thread performance. The speed of the system memory could be a secondary influence (if memory serves most of the control structures are cacheable, though).

cudaMallocHost() is a thin wrapper around OS API calls. So as far as software is concerned, performance boils down to the operating system, its configuration, and its current state (so an apples-to-apples comparison might want to use idling freshly-booted systems, for example). I have no experience with the impact of virtualization environments on these operations, but would expect measurable negative impact.

strace or a similar system-level tracing facility should be able to show you the OS API calls that result from a call to cudaMallocHost().

Thanks for your help!
I still could not solve the problem.
Here are two minimal version of the test.

This one allocates slightly less than 2GB of RAM and works fine:

#include <cstdio>
#include "cuda_runtime.h"
#include <sys/mman.h>
void checkError(cudaError_t result, char const *const func, const char *const file, int const line) 
{
        if (result != cudaSuccess) {
                // Check error and exit applications.
                // Only use this in test applications.
                fprintf(stderr, "CUDA error at %s:%d code=%d \"%s\" \n", file, line, static_cast<unsigned int>(result), func);
                exit(EXIT_FAILURE);
        }
}
#define checkCudaErrors(val) checkError((val), #val, __FILE__, __LINE__)

int main(void)
{
        size_t memsize = 2*1023ull*1024ull*1024ull; 
        void * p = malloc(memsize);
        if (p==nullptr) {
                perror("malloc failed");
        }
        checkCudaErrors(cudaHostRegister(p, memsize, 0));
        return 0;
}

this one hangs for 35 seconds:

#include <cstdio>
#include "cuda_runtime.h"
#include <sys/mman.h>
void checkError(cudaError_t result, char const *const func, const char *const file, int const line) 
{
        if (result != cudaSuccess) {
                // Check error and exit applications.
                // Only use this in test applications.
                fprintf(stderr, "CUDA error at %s:%d code=%d \"%s\" \n", file, line, static_cast<unsigned int>(result), func);
                exit(EXIT_FAILURE);
        }
}
#define checkCudaErrors(val) checkError((val), #val, __FILE__, __LINE__)

int main(void)
{
        size_t memsize = 2*1024ull*1024ull*1024ull; // hangs for 35 seconds
        void * p = malloc(memsize);
        if (p==nullptr) {
                perror("malloc failed");
        }
        checkCudaErrors(cudaHostRegister(p, memsize, 0));
        return 0;
}

Here is the strace output for the version with less than 2GB; https://pastebin.com/77s6mXPy

And here for the version that hangs: https://pastebin.com/UUVvv3ju

BTW, I performed this experiments after a reboot.

Run the code under strace -tt so we can see where the time is spent

thanks for the hint and sorry for the late reply (the forum was not accessible for a while).
Anyhow, here are the last few lines of strace -tt
output:

...
...
...
22:38:47.923818 close(24)               = 0
22:38:47.923842 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
22:38:47.923872 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 24
22:38:47.923903 fcntl(24, F_SETFD, FD_CLOEXEC) = 0
22:38:47.923928 ioctl(6, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38), 0x7ffc2197bde0) = 0
22:38:48.260125 close(24)               = 0
22:38:48.260218 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc2197b950) = 0
22:38:48.265659 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc2197b950) = 0
22:38:48.285635 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc2197b950) = 0
22:38:48.291116 exit_group(0)           = ?
22:39:24.083613 +++ exited with 0 +++

So it appears that it spends the 36 seconds after the call to exit. Hmm, I’m puzzled.
Could it be a driver problem?

When I explicitly call cudaHostUnregister on the other hand like this:

int main(void)
{
        size_t memsize = 2*1024ull*1024ull*1024ull; // hangs for 35 seconds
        void * p = malloc(memsize);
        if (p==nullptr) {
                perror("malloc failed");
        }
        checkCudaErrors(cudaHostRegister(p, memsize, 0));
        checkCudaErrors(cudaHostUnregister(p));
        free(p);
        return 0;
}

the output of strace is showing where the time is spent:

...
...
...
22:48:05.797338 close(24)               = 0
22:48:05.797413 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc3effc3b0) = 0
22:48:05.803038 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc3effc3b0) = 0
22:48:05.823039 ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc3effc3b0) = 0
22:48:05.830216 ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc3effcf80) = 0
22:48:05.830321 ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc3effcf40) = 0
22:48:05.830358 ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc3effcf80) = 0
22:48:05.832869 ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc3effcf40) = 0
22:48:05.832902 ioctl(4, _IOC(_IOC_NONE, 0, 0x42, 0), 0x7ffc3effcf80) = 0
22:48:05.832990 ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc3effcf40) = 0
22:48:05.833025 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc3effd050) = 0
22:48:41.335864 munmap(0x7f2e47317000, 2147487744) = 0
22:48:41.377335 exit_group(0)           = ?
22:48:41.423861 +++ exited with 0 +++

I am not sure what to make of that snippet. Maybe Robert Crovella can spot something. This may be besides the point, but are you running the persistence daemon (https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon)?

Can you tell us a bit about the machine specifications? Is this a home-brew system, or a system you bought fully configured from a vendor? If the latter, what vendor and model? If the former, single or dual socket, what kind of CPU, what kind of system memory (# DRAM channels, DRAM speed). What kind of mass storage is used: HDD or SSD? If the latter, NVMe?

munmap is an OS call. So your OS is taking a long time to unmap the previous (2147487744 byte) allocation. This also explains why the previous test showed the time consumption “at the end”.

If I’m reading that strace output correctly, that is unlikely to be a NVIDIA issue, driver or otherwise. That’s just my opinion, of course.

You’re welcome to file a bug according to the instructions linked in a sticky post at the top of this forum. If you’re still working on Ubuntu 19.04 then that is potentially an issue.

What file does the descriptor 3 belong to? Would be interesting if that belongs to the Nvidia driver.

Thanks for so much help, you are really great!

Robert: I think the last strace output shows, that the time is actually spent before the call to munmap.

tera: if I call cudaHostUnregister() before, than it hangs there (see the last strace output above … it hangs before the munmap).

njuffa custom (consumer) system. The specs are like this:

  • AMD Ryzen 7 2700X Eight-Core Processor
  • 4 DRAM Channels (Size: 16384 MB each)
  • DRAM speed is interesting: the configured speed seems to only be half of the speed possible:
$ sudo dmidecode --type 17
(just one output)
Handle 0x0035, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x002E
	Error Information Handle: 0x0034
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_A1
	Bank Locator: BANK 0
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2133 MT/s
	Manufacturer: Corsair
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMW64GX4M4A2666C16  
	Rank: 2
	Configured Memory Speed: 1067 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
  • It is a NVME SSD

tera: I think the file descriptor 3 at the end is created by this call:

openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 3

Trawling the internet tells me: (1) munmap is slow, much slower than mmap (2) munmap slowness increases [linearly?] with size of allocation (3) Slowness increases with size of machine, in particular # CPUs, as TLBs need to be invalidated (4) At least on older Linux versions munmap causes a “memory mapping” lock to be taken for a long time, which can cause system-wide performance issues.

One way forward here may be to set up a small mmap / munmap benchmark app, and discuss any noteworthy findings in a Linux-centric forum. While the makers of Ubuntu often “Think different” from the makers of other distros, I can find no evidence of Ubuntu-specific munmap performance issues in my searches, so I would expect such problems to be generic.

I don’t agree with that, but i’ve been proven wrong before. If you’d like to prove me wrong, I suggest studying the strace manpage and consider use of the -r switch instead of or in addition to the -tt switch.

https://linux.die.net/man/1/strace

The -t switch timestamps the printout. The printout is not generated (AFAIK) until the function returns, because the function return value (= 0) is included in the printout.

It’s possible I’m wrong, of course, but the -r or other strace switch may help to clarify.

Yeah, sorry for the wrong flag. I believe what we actually want is “strace -tt -T”.

For whatever reason, my tests of the provided code don’t show anything near 36 sec performance. So I think what’s appropriate here is just a basic QA process of finding out how to reproduce the observation and then doing careful diff analysis to figure out what the key difference is, that triggers this behavior.

Anyone can do that of course, with sufficient time and patience. Repetetive asking of questions on the internet is not the way to tackle this IMO, but that may be just a statement of my own personal style rather than anything that has merit. I’m not going to argue the point.

The first things I would try with more time would be to try to exactly duplicate the observation. So far no luck, but I haven’t even tried Ubuntu 19.04 yet.

Coupled with that, from an end user’s perspective, the fact that 19.04 isn’t officially supported ought to strongly motivate OP to go off and try a supported config, before doing anything else. That’s probably a good idea even if they weren’t having this issue. But again, just my $0.02. Bug reports are always welcomed.

By the way, is this testing being done on AWS? I noticed a mention of AWS earlier, but also a mention of testing with a Geforce RTX card, and I couldn’t align the two comments. I have assumed so far that this is all being done on a bare metal scenario.