Is cudaMallocHost allocated physical memory?

int* pMem
cudaMallocHost((void**)&pMem, sizeof(int))

I have a question about cudaMallocHost.

  1. Is pMem virtual memory? Or is it physical memory?

  2. cudaMallocHost allocates fixed memory to the CPU. Can this memory (pMem) be used by the GPU and CPU together?
    Or do I have to cudamemcpy to use it on GPU?

pMem will be accessible by both gpu and cpu if you use cudaMallocManaged using CUDA Unified Memory. With cudaMallocHost you have to cudamemcpy to access from GPU.

Memory allocated via cudaMallocHost is directly accessible from GPUs without the need of explicit memcpy.

See documentation:


Allocates size bytes of host memory that is page-locked and accessible to the device.

directly means that you can access memory asynchronously in relation to the CPU. The device’s pointer should not be the same, and you still have to copy explicitly with memcpy or memcpyasync.

No. You can use the pointer returned by cudaMallocHost in a kernel, and do not need explicit transfer. See this toy example:

#include < cstdio >  

__global__ void kernel(int* data, int n){
    if(threadIdx.x < n){
        data[threadIdx.x] = data[threadIdx.x] + 5;
int main(){
    int* h_ptr;
    cudaMallocHost(&h_ptr, sizeof(int) * 3);
    h_ptr[0] = 1;
    h_ptr[1] = 2;
    h_ptr[2] = 3;
    kernel<<<1,32>>>(h_ptr, 3);
    printf("%d %d %d\n", h_ptr[0], h_ptr[1], h_ptr[2]);
    return 0;

Compiled with nvcc -arch=sm_61 -o main
Output: 6 7 8


good to know. Thanks!

While cudaMallocHost memory can be accessed by both CPU/GPU, at least in my latest project with Turing/7.5/11.0, the profiler shows it “living” in CPU/system memory and there was a device performance impact in my case.

When using it for an input buffer (CPU write once and sequential GPU 32-bit reads), was noticeably slower than allocating device memory and cudaMemcpy to transfer from host to device. Possible that more optimized reads might hide the impact. (Or possible the cost is shifted to the cudaMemcpy rather than during kernel runtime, but the former is highly optimized.)

I continue using cudaMallocHost for output buffers (cannot beat the convenience) and very small input buffers (a few K or less). Was surprised when a 256K input buffer showed such an impact.