cudaHostRegister returns cudaErrorInvalidValue

I have opened a file in READONLY mode. Have mapped it in the host memory using mmap(…)

uint8_t *data_ptr = (uint8_t *) mmap(NULL,NumOfBytes,PROT_READ,MAP_PRIVATE, file_descriptor, 0);

Now i want to lock the memory using cudaHostRegister(…) so that i can use this in cuda API cudaMemcpyAsync(…)


The mmap doesn’t return an error but the cudaHostRegister does i.e. error code 0x11 (cudaErrorInvalidValue).

cudaErrorInvalidValue descriptions says the following:

This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values

I came to understood that the mapped file(s) are not backed by physical addresses, therefor, they can’t be used as a pinned memory, so i did the following.

/* This ptr will hold the physical location of the file */
    ptr = malloc(size)

/* Virtual address of mapped file */
    tmp_ptr = mmap(file)

/* Copy the contents of file to the ptr */

/* unmapping the file */

/* Register the ptr */

This technique worked but there are two issues with this approach.

  1. memcpy takes time for big files.

  2. memcpy fails (segmentation fault) for files ~4GB.

Though i have memory free space available ~10GB.

  1. I haven't tried this myself but you probably need to add the MAP_LOCKED flag to the mmap() call. Then you shouldn't need the copy to a separate location.
  2. Make sure you pass in a size_t to all relevant calls and do not allow it to be truncated to (unsigned) integers anywhere. You may also need to increase the limit for locked memory (ulimit -m in bash).

You are right about this, i used mlock using one pointer and it worked. thank you.

After trying this myself, it turned out you don’t even need to pass the MAP_LOCKED flag to mmap(). There are a few other caveats however listed in the documentation for cudaHostRegister().

The current scenario which is working but i am skeptical why this works.

I had Mapped the file (without Pinning/locking i.e. Not using mlock or cudaHostRegister) and used it directly in cudaMemcpyAsync. I am curious why cudaMemcpyAsync didn’t complained that the host memory isn’t locked !

The cudaMemcpyAsync description says the following

It (cudaMemcpyAsync) <b>only works on page-locked host memory</b> and returns an error if a pointer to pageable memory is passed as input

[s]If I recall correctly, for historical reasons of backwards compatibility, cudaMemcpyAsync() will silent convert to a synchronous copy if the memory passed in is pageable rather than page-locked.

Here is my vague recollection: The cudaMemcpyAsync() description states the intended behavior as it was designed. However, in some very early CUDA versions there was a bug in the implementation. By the time it was discovered, correcting the bug would have meant breaking existing applications. Not a good idea if you are trying to grow mindshare for a new parallel programming environment. The silent fall-back to cudaMemcpy() when passed a pointer to pageable memory was a way to get out from between a rock and a hard place.

My memory maybe faulty after all this time, so I am inviting Robert Crovella to refute this account if I got it wrong.[/s]

On modern systems with copious system memory bandwidth, copies to / from pageable memory are quite fast, as system memory bandwidth easily exceeds PCIe bandwidth by a non-trivial factor, e.g. 12.5 GB/sec vs 70 GB/sec. But the synchronous nature of the copy in the fall-back case could cause performance artifacts (e.g. by interfering with stream operation).

I am using two streams along with its respective callbacks. Here is the slice from visual-profiler for the lock and unlock version of my module respectively.

Both these figures shows that simultaneous copy-kernel execution takes place.

No, it does not say that. Not in any reputable source. That is a completely false statement.

Here is a reputable source for a description of cudaMemcpyAsync:

The reference to its behavior under various scenarios is given in this statement:

“This function exhibits asynchronous behavior for most use cases.”

where the embedded hyperlink takes you to this page, which describes its behavior in various scenarios of pageable vs. pinned memory:

What Robert Crovella said. Apparently my memory is poor :-(

It seems fine to me. Your statement: “The silent fall-back to cudaMemcpy() when passed a pointer to pageable memory” is a very concise and AFAIK accurate description of the behavior (and equivalently points out the flaw in OP’s claim). I don’t think any of the links I’ve provided indicate anything different. I don’t have the historical background that you have, so can’t really comment on the rest of it. It may be entirely accurate for all I know.

Again, just as you said, “cudaMemcpyAsync() will silently convert to a synchronous copy if the memory passed in is pageable rather than page-locked”

I had actually read this description from here

As you had just said, its non-reputable source, so we will just ignore what they say.

In my case, i am using pageable memory and the profiler shows that the copy-kernel execution take place asynchronously. Does it means that the driver first pin the portion of memory and then execute cudaMemcpyAsync() ?

Actually i want to avoid pinning the mapped region for large files because it take time on the host.

Actually, that link appears to point to a copy of the official CUDA 4.0 documentation (from about 8 years ago), which does contain the sentence you quoted earlier. So maybe my recollection of the behavior in newer versions of CUDA differing from this description due to a bug in the early days of CUDA was correct after all …

The key thing to keep in mind is that APIs can change, or their description may be corrected and clarified over time. It is therefore best to consult the documentation for the CUDA version one is actually using. Which is likely to be CUDA 8.x, 9.x, or 10 at this time.

The way cudaMemcpy() performs host -> device copies is by copying from the original host source to a page-locked buffer maintained by the driver, and from there the contents is transferred to the GPU by DMA (“copy engine”). So pretty much like you imagined it to work. How that process reflects in the output of the CUDA profiler, I could not say off the top of my head. For small host -> device copies the driver may also elect to send down the data with the command stream, to reduce latency compared to the two-step process.