Is cudaMallocHost allocated physical memory?

_JJANG · July 2, 2020, 10:22am

int* pMem
cudaMallocHost((void**)&pMem, sizeof(int))

Hello
I have a question about cudaMallocHost.

Is pMem virtual memory? Or is it physical memory?
cudaMallocHost allocates fixed memory to the CPU. Can this memory (pMem) be used by the GPU and CPU together?
Or do I have to cudamemcpy to use it on GPU?

maiconfaria · July 3, 2020, 11:51am

pMem will be accessible by both gpu and cpu if you use cudaMallocManaged using CUDA Unified Memory. With cudaMallocHost you have to cudamemcpy to access from GPU.

striker159 · July 3, 2020, 4:05pm

Memory allocated via cudaMallocHost is directly accessible from GPUs without the need of explicit memcpy.

See documentation: CUDA Runtime API :: CUDA Toolkit Documentation

Description

Allocates size bytes of host memory that is page-locked and accessible to the device.

maiconfaria · July 8, 2020, 6:17pm

directly means that you can access memory asynchronously in relation to the CPU. The device’s pointer should not be the same, and you still have to copy explicitly with memcpy or memcpyasync.

striker159 · July 9, 2020, 5:17am

No. You can use the pointer returned by cudaMallocHost in a kernel, and do not need explicit transfer. See this toy example:

#include < cstdio >  

__global__ void kernel(int* data, int n){
    if(threadIdx.x < n){
        data[threadIdx.x] = data[threadIdx.x] + 5;
    }
}
int main(){
    int* h_ptr;
    cudaMallocHost(&h_ptr, sizeof(int) * 3);
    h_ptr[0] = 1;
    h_ptr[1] = 2;
    h_ptr[2] = 3;
    kernel<<<1,32>>>(h_ptr, 3);
    cudaDeviceSynchronize();
    printf("%d %d %d\n", h_ptr[0], h_ptr[1], h_ptr[2]);
    cudaFreeHost(h_ptr);
    return 0;
}

Compiled with nvcc -arch=sm_61 main.cu -o main
Output: 6 7 8

maiconfaria · July 15, 2020, 2:38pm

good to know. Thanks!

teamvraz · July 15, 2020, 10:55pm

While cudaMallocHost memory can be accessed by both CPU/GPU, at least in my latest project with Turing/7.5/11.0, the profiler shows it “living” in CPU/system memory and there was a device performance impact in my case.

When using it for an input buffer (CPU write once and sequential GPU 32-bit reads), was noticeably slower than allocating device memory and cudaMemcpy to transfer from host to device. Possible that more optimized reads might hide the impact. (Or possible the cost is shifted to the cudaMemcpy rather than during kernel runtime, but the former is highly optimized.)

I continue using cudaMallocHost for output buffers (cannot beat the convenience) and very small input buffers (a few K or less). Was surprised when a 256K input buffer showed such an impact.

Topic		Replies	Views
Simple cudaMallocHost beginner question CUDA Programming and Performance	5	2719	September 29, 2008
How to access the memory that is allocated using cudaMallocHost from cpu? CUDA Programming and Performance cuda , kernel	1	726	September 4, 2022
cuMemAllocHost, how to use ? CUDA Programming and Performance	3	4925	October 29, 2007
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	13039	March 30, 2022
cudaMallocHost confusion CUDA Programming and Performance	6	9830	June 24, 2011
Memory usage within GPU CUDA Programming and Performance	2	2354	July 13, 2009
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	382	October 9, 2018
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6303	January 22, 2025
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	830	October 8, 2018
Accessing GPU global memory allocated on device - by host CUDA Programming and Performance	3	1198	June 3, 2013

Is cudaMallocHost allocated physical memory?

Description

Related topics