cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available)

The code file is attached, comments about the code at the end.

I’m making a program to compute the pairwise correlation of the rows of a matrix.
Input: Matrix A size MxN
Output: Matrix res size MxM, where res[i,j] = Correlation(A[i,:], A[j,:])

On my NVIDIA A30, the benchmark results are:

  1. Using pinned memory:
    CPU total cost = 9640.08
    Malloc cost = 0.404802ms
    CPU to GPU cost = 9.9302ms
    Kernel cost = 61.7759ms
    GPU to CPU cost = 54.9704ms
    Free memory cost = 18.4563ms
    GPU total cost = 145.62

  2. Using normal memory:
    CPU total cost = 9693.36
    Malloc cost = 0.524011ms
    CPU to GPU cost = 10.5241ms
    Kernel cost = 61.9264ms
    GPU to CPU cost = 81.8329ms
    Free memory cost = 0.720419ms
    GPU total cost = 155.602

GPU/CPU data transfer speed is faster as expected. However, cudaFreeHost always cost around 18.25ms, which takes up a noticeable 12% of the total cost. Why is it so slow and how can I improve it?


  1. Build command: nvcc -o main -std=c++17 -O3 -gencode=arch=compute_80,code=sm_80
  2. The lines that contain cudaHostAlloc and cudaFreeHost is at `lines 260->264, 282->283. Just uncomment the version you want to test.
  3. I know that using vector<vector<float>> to represent a matrix is bad, and I should use a continuous buffer in memory. But the API constraints require me to return a vector<vector<float>>, so ignore that. (11.5 KB)

Edit: the forum only allows new users to attach 1 file. So here is “my_timer.h”

#pragma once
#include <chrono>

class MyTimer {
    std::chrono::time_point<std::chrono::system_clock> start;

    void startCounter() {
        start = std::chrono::system_clock::now();

    int64_t getCounterNs() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();

    int64_t getCounterMs() {
        return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();

    double getCounterMsPrecise() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
                / 1000000.0;

Someone from NVIDIA may be able to give an authoritative answer. Here is my understanding of the situation:

cudaHostAlloc needs to provide physically contiguous pages for the pinned allocation, which takes a lot of effort on the part of the operating system to provide, especially for large allocations.

As far as I am aware (my knowledge is limited), the time spent in cudaHostAlloc is almost entirely spent in system calls like mmap and ioctl. You could try and use a system-level tracing facility like strace to look at the details of that.

I am not clear on why freeing the allocated memory is also slow. My hypothesis is that this uses munmap calls that are just as expensive as mmap, for reasons unknown to me but likely known to operating-system experts.

Generally speaking, dynamic allocation and de-allocation of memory should be minimized. The cost inside CUDA has been consistently high over the years, even though there have been indications that NVIDIA has spent some effort trying to reduce the overhead. You may want to examine whether pinned allocations are necessary in the context of your use case. While they can provided higher transfer host<->device throughput, I would expect this speedup to be moderate for modern workstation and server platforms (which have decent system memory throughput compared to the platforms available early in the evolution of CUDA).

Generally speaking, the performance of CUDA APIs heavily dependent on operating system API calls like these scales largely with the single-thread performance of the CPU. I would recommend using CPUs with a base frequency of 3.5 GHz or more.

These are screenshots from nsight-systems for cudaMallocHost, malloc, as well as std::vector. Your original code does not compile because “my_timer.h” is missing.

Reported device-to-host transfer speeds are 13.1 GiB/s, 2 GiB/s, and 11 GiB/s, respectively .
I added the test with std::vector because the transfer speed with plain malloc was unreasonably low. I suspect that in the case of std::vector the memory is initialized so the page tables for the buffer can be set up while the kernel is running (note the gap between kernel launch and cudaDeviceSynchronize). With malloc this probably happens during the transfer.

Oh, I forgot to attach that. I’ve added “my_timer.h” code to the post because it doesn’t allow me to upload another file

cudaFreeHost is expected to be slower than free. It is doing extra work. A full description of the extra work is not provided by NVIDIA, but can be inspected using a utility like strace. Among other things, it is manipulating CPU page tables and requesting the OS to change parameters of the pages that were previously mapped.

There isn’t anything you can do to make cudaFreeHost run faster. If 18.25ms is 12% of total cost, then you’re indicating the total cost is on the order of 100ms. If that is the sum total of all the work you intend to do on the GPU, you are probably wasting your time using CUDA.

However if that 100ms is part of some larger workflow, then you may wish to do the following:

  • don’t do allocation/free at each step. Instead, allocate buffers once, at the beginning of your application, and free them once, at the end.

  • to the extent possible, re-use a pinned buffer. You already know that pinning improves transfer time. As much as possible, re-use the buffers for different data items, rather than providing a buffer for each.

  • reduce the size of the buffers to the minimum necessary/practical to achieve a particular benefit. In my experience the time cost of cudaHostAlloc and cudaFreeHost has a linear component related to the size of the allocation.

It’s certainly possible that none of these suggestions are useful for you. There isn’t anything you can do to reduce the cost of a specific cudaFreeHost call.

Hmm I guess the only solution is to use a custom memory allocator like your first suggestion.
Thanks for your help!

Anyway, ~100ms is cost on GPU, on CPU the task actually takes 12 seconds. It’s a really big improvement.