Pinned (page locked) memory and CUDA arrays problem


I do some frequent transfers between host and device. Before I was using normal host memory.
Now, I moved to page locked memory and got problems.

When I try to copy it using cudaMemcpy3D (copy to 3d CUDA array) - it crashes somewhere inside drivers code (at the 3rd call to it).

In spec I found: Overlap of Data Transfer and Kernel Execution
Some devices of compute capability 1.1 and higher can perform copies between page-locked host memory and device memory concurrently with kernel execution. Applications may query this capability by calling cudaGetDeviceProperties() and checking the deviceOverlap property. This capability is currently supported only for memory copies that do not involve CUDA arrays or 2D arrays allocated through cudaMallocPitch() (see Section 3.2.1).

Ok, but this is about asynchronous copy, I try to do it synchronous. Is it supported?


Found that linear cudaMemcpy also fails. Problem seems to be unrelated to CUDA arrays. But related to page locked memory.

I’m allocating about 8 page locked buffers. Each have 16k size. It doesnt look like too much, isn’t it?
I use exactly the same code to allocate each of them. But for some reasons it fails to read from the 3rd and crashes inside drivers code.

Ok found a problem. Probably it will be useful for someone.

This problem is related to multithreading. What I was doing is:

  1. Allocate and fill page locked memory in thread 1.
  2. Read and delete page locked memory in thread 2.

But CUDA page locked allocation is bound to thread, so when thread 1 finishes - memory is deallocated automatically, so in thread 2 it is not valid anymore.

I think that your deduction is wrong… I use it all the time.

you probably have to add cudaHostAllocPortable flag to cudaHostAlloc.