cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available)

lehuyduc4 · June 15, 2022, 4:53am

The code file is attached, comments about the code at the end.

I’m making a program to compute the pairwise correlation of the rows of a matrix.
Input: Matrix A size MxN
Output: Matrix res size MxM, where res[i,j] = Correlation(A[i,:], A[j,:])

On my NVIDIA A30, the benchmark results are:

Using pinned memory:
CPU total cost = 9640.08
Malloc cost = 0.404802ms
CPU to GPU cost = 9.9302ms
Kernel cost = 61.7759ms
GPU to CPU cost = 54.9704ms
Free memory cost = 18.4563ms
GPU total cost = 145.62
Using normal memory:
CPU total cost = 9693.36
Malloc cost = 0.524011ms
CPU to GPU cost = 10.5241ms
Kernel cost = 61.9264ms
GPU to CPU cost = 81.8329ms
Free memory cost = 0.720419ms
GPU total cost = 155.602

GPU/CPU data transfer speed is faster as expected. However, cudaFreeHost always cost around 18.25ms, which takes up a noticeable 12% of the total cost. Why is it so slow and how can I improve it?

Note:

Build command: nvcc -o main main.cu -std=c++17 -O3 -gencode=arch=compute_80,code=sm_80
The lines that contain cudaHostAlloc and cudaFreeHost is at `lines 260->264, 282->283. Just uncomment the version you want to test.
I know that using vector<vector<float>> to represent a matrix is bad, and I should use a continuous buffer in memory. But the API constraints require me to return a vector<vector<float>>, so ignore that.
correlation_gpu.cu (11.5 KB)

Edit: the forum only allows new users to attach 1 file. So here is “my_timer.h”

#pragma once
#include <chrono>

class MyTimer {
    std::chrono::time_point<std::chrono::system_clock> start;

public:
    void startCounter() {
        start = std::chrono::system_clock::now();
    }

    int64_t getCounterNs() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();
    }

    int64_t getCounterMs() {
        return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
    }

    double getCounterMsPrecise() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
                / 1000000.0;
    }
};

njuffa · June 15, 2022, 6:47am

Someone from NVIDIA may be able to give an authoritative answer. Here is my understanding of the situation:

cudaHostAlloc needs to provide physically contiguous pages for the pinned allocation, which takes a lot of effort on the part of the operating system to provide, especially for large allocations.

As far as I am aware (my knowledge is limited), the time spent in cudaHostAlloc is almost entirely spent in system calls like mmap and ioctl. You could try and use a system-level tracing facility like strace to look at the details of that.

I am not clear on why freeing the allocated memory is also slow. My hypothesis is that this uses munmap calls that are just as expensive as mmap, for reasons unknown to me but likely known to operating-system experts.

Generally speaking, dynamic allocation and de-allocation of memory should be minimized. The cost inside CUDA has been consistently high over the years, even though there have been indications that NVIDIA has spent some effort trying to reduce the overhead. You may want to examine whether pinned allocations are necessary in the context of your use case. While they can provided higher transfer host<->device throughput, I would expect this speedup to be moderate for modern workstation and server platforms (which have decent system memory throughput compared to the platforms available early in the evolution of CUDA).

Generally speaking, the performance of CUDA APIs heavily dependent on operating system API calls like these scales largely with the single-thread performance of the CPU. I would recommend using CPUs with a base frequency of 3.5 GHz or more.

striker159 · June 15, 2022, 8:07am

These are screenshots from nsight-systems for cudaMallocHost, malloc, as well as std::vector. Your original code does not compile because “my_timer.h” is missing.

Reported device-to-host transfer speeds are 13.1 GiB/s, 2 GiB/s, and 11 GiB/s, respectively .
I added the test with std::vector because the transfer speed with plain malloc was unreasonably low. I suspect that in the case of std::vector the memory is initialized so the page tables for the buffer can be set up while the kernel is running (note the gap between kernel launch and cudaDeviceSynchronize). With malloc this probably happens during the transfer.

lehuyduc4 · June 15, 2022, 8:30am

Oh, I forgot to attach that. I’ve added “my_timer.h” code to the post because it doesn’t allow me to upload another file

Robert_Crovella · June 15, 2022, 1:32pm

cudaFreeHost is expected to be slower than free. It is doing extra work. A full description of the extra work is not provided by NVIDIA, but can be inspected using a utility like strace. Among other things, it is manipulating CPU page tables and requesting the OS to change parameters of the pages that were previously mapped.

There isn’t anything you can do to make cudaFreeHost run faster. If 18.25ms is 12% of total cost, then you’re indicating the total cost is on the order of 100ms. If that is the sum total of all the work you intend to do on the GPU, you are probably wasting your time using CUDA.

However if that 100ms is part of some larger workflow, then you may wish to do the following:

don’t do allocation/free at each step. Instead, allocate buffers once, at the beginning of your application, and free them once, at the end.
to the extent possible, re-use a pinned buffer. You already know that pinning improves transfer time. As much as possible, re-use the buffers for different data items, rather than providing a buffer for each.
reduce the size of the buffers to the minimum necessary/practical to achieve a particular benefit. In my experience the time cost of cudaHostAlloc and cudaFreeHost has a linear component related to the size of the allocation.

It’s certainly possible that none of these suggestions are useful for you. There isn’t anything you can do to reduce the cost of a specific cudaFreeHost call.

lehuyduc4 · July 26, 2022, 9:50am

Hmm I guess the only solution is to use a custom memory allocator like your first suggestion.
Thanks for your help!

Anyway, ~100ms is cost on GPU, on CPU the task actually takes 12 seconds. It’s a really big improvement.

Topic		Replies	Views
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8869	November 17, 2021
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2145	August 26, 2011
Why cudamalloc and cudaFree so expensive? CUDA Programming and Performance cuda	7	2839	November 14, 2020
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	819	November 3, 2020
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2374	September 3, 2010
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	615	March 28, 2024
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	610	March 7, 2019
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	66643	September 7, 2023
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1998	April 18, 2017
cudaFreeHost very slow for buffers > 2GB [newbee here!] CUDA Programming and Performance	30	2518	October 3, 2019

cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available)

Related topics