cudaHostRegister returns cudaErrorInvalidValue

tajiknomi · December 5, 2018, 7:24am

I have opened a file in READONLY mode. Have mapped it in the host memory using mmap(…)

uint8_t *data_ptr = (uint8_t *) mmap(NULL,NumOfBytes,PROT_READ,MAP_PRIVATE, file_descriptor, 0);

Now i want to lock the memory using cudaHostRegister(…) so that i can use this in cuda API cudaMemcpyAsync(…)

cudaHostRegister(data_ptr,NumOfBytes,cudaHostRegisterDefault);

The mmap doesn’t return an error but the cudaHostRegister does i.e. error code 0x11 (cudaErrorInvalidValue).

cudaErrorInvalidValue descriptions says the following:

This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values

tajiknomi · December 10, 2018, 9:36am

I came to understood that the mapped file(s) are not backed by physical addresses, therefor, they can’t be used as a pinned memory, so i did the following.

/* This ptr will hold the physical location of the file */
    ptr = malloc(size)

/* Virtual address of mapped file */
    tmp_ptr = mmap(file)

/* Copy the contents of file to the ptr */
    memcpy(ptr,tmp_ptr,size)

/* unmapping the file */
    munmap(tmp_ptr,..)

/* Register the ptr */
    cudaHostRegister(ptr,size,..)

This technique worked but there are two issues with this approach.

memcpy takes time for big files.
memcpy fails (segmentation fault) for files ~4GB.

Though i have memory free space available ~10GB.

tera · December 10, 2018, 2:02pm

I haven't tried this myself but you probably need to add the MAP_LOCKED flag to the mmap() call. Then you shouldn't need the copy to a separate location.
Make sure you pass in a size_t to all relevant calls and do not allow it to be truncated to (unsigned) integers anywhere. You may also need to increase the limit for locked memory (ulimit -m in bash).

tajiknomi · December 12, 2018, 9:06am

You are right about this, i used mlock using one pointer and it worked. thank you.

tera · December 13, 2018, 9:47pm

After trying this myself, it turned out you don’t even need to pass the MAP_LOCKED flag to mmap(). There are a few other caveats however listed in the documentation for cudaHostRegister().

tajiknomi · December 17, 2018, 5:53am

The current scenario which is working but i am skeptical why this works.

I had Mapped the file (without Pinning/locking i.e. Not using mlock or cudaHostRegister) and used it directly in cudaMemcpyAsync. I am curious why cudaMemcpyAsync didn’t complained that the host memory isn’t locked !

The cudaMemcpyAsync description says the following

It (cudaMemcpyAsync) <b>only works on page-locked host memory</b> and returns an error if a pointer to pageable memory is passed as input

njuffa · December 17, 2018, 7:06am

~~If I recall correctly, for historical reasons of backwards compatibility, cudaMemcpyAsync() will silent convert to a synchronous copy if the memory passed in is pageable rather than page-locked.~~

Here is my vague recollection: The cudaMemcpyAsync() description states the intended behavior as it was designed. However, in some very early CUDA versions there was a bug in the implementation. By the time it was discovered, correcting the bug would have meant breaking existing applications. Not a good idea if you are trying to grow mindshare for a new parallel programming environment. The silent fall-back to cudaMemcpy() when passed a pointer to pageable memory was a way to get out from between a rock and a hard place.

My memory maybe faulty after all this time, so I am inviting Robert Crovella to refute this account if I got it wrong.

On modern systems with copious system memory bandwidth, copies to / from pageable memory are quite fast, as system memory bandwidth easily exceeds PCIe bandwidth by a non-trivial factor, e.g. 12.5 GB/sec vs 70 GB/sec. But the synchronous nature of the copy in the fall-back case could cause performance artifacts (e.g. by interfering with stream operation).

tajiknomi · December 17, 2018, 9:29am

I am using two streams along with its respective callbacks. Here is the slice from visual-profiler for the lock and unlock version of my module respectively.

[url]Dropbox - pinned.png - Simplify your life

[url]Dropbox - Pageable.png - Simplify your life

Both these figures shows that simultaneous copy-kernel execution takes place.

Robert_Crovella · December 17, 2018, 2:33pm

No, it does not say that. Not in any reputable source. That is a completely false statement.

Here is a reputable source for a description of cudaMemcpyAsync:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

The reference to its behavior under various scenarios is given in this statement:

“This function exhibits asynchronous behavior for most use cases.”

where the embedded hyperlink takes you to this page, which describes its behavior in various scenarios of pageable vs. pinned memory:

https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

njuffa · December 17, 2018, 6:01pm

What Robert Crovella said. Apparently my memory is poor :-(

Robert_Crovella · December 17, 2018, 9:49pm

It seems fine to me. Your statement: “The silent fall-back to cudaMemcpy() when passed a pointer to pageable memory” is a very concise and AFAIK accurate description of the behavior (and equivalently points out the flaw in OP’s claim). I don’t think any of the links I’ve provided indicate anything different. I don’t have the historical background that you have, so can’t really comment on the rest of it. It may be entirely accurate for all I know.

Again, just as you said, “cudaMemcpyAsync() will silently convert to a synchronous copy if the memory passed in is pageable rather than page-locked”

tajiknomi · December 18, 2018, 4:16am

I had actually read this description from here http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g732efed5ab5cb184c920a21eb36e8ce4.html

As you had just said, its non-reputable source, so we will just ignore what they say.

In my case, i am using pageable memory and the profiler shows that the copy-kernel execution take place asynchronously. Does it means that the driver first pin the portion of memory and then execute cudaMemcpyAsync() ?

Actually i want to avoid pinning the mapped region for large files because it take time on the host.

njuffa · December 18, 2018, 4:31am

Actually, that link appears to point to a copy of the official CUDA 4.0 documentation (from about 8 years ago), which does contain the sentence you quoted earlier. So maybe my recollection of the behavior in newer versions of CUDA differing from this description due to a bug in the early days of CUDA was correct after all …

The key thing to keep in mind is that APIs can change, or their description may be corrected and clarified over time. It is therefore best to consult the documentation for the CUDA version one is actually using. Which is likely to be CUDA 8.x, 9.x, or 10 at this time.

The way cudaMemcpy() performs host → device copies is by copying from the original host source to a page-locked buffer maintained by the driver, and from there the contents is transferred to the GPU by DMA (“copy engine”). So pretty much like you imagined it to work. How that process reflects in the output of the CUDA profiler, I could not say off the top of my head. For small host → device copies the driver may also elect to send down the data with the command stream, to reduce latency compared to the two-step process.

kshitij.lakhani · January 27, 2021, 11:53pm

@tajiknomi So how did you get about this problem then ?
@njuffa @Robert_Crovella Could you chime in as well?
I have mmap’d a file the same way OP had and now wish to use it with cudaMemcpyAsync() H2D. Is the correct method to just page-lock and register using cudaHostRegister() after mmap() has been performed ?

I ask because cudaHostRegister() page-locks the memory as per the documentation however, cudaMemcpyAsync() requires pinned host memory. My understanding is that pinned and page-locked are not the same .
Looking for correct guidance. TIA

Robert_Crovella · January 28, 2021, 3:07am

I have no idea. Perhaps OP will be able to help. I wouldn’t consider this a typical or tested methodology. Did you try any of the instructions listed here? OP seems to indicate things are working. I don’t know of any NVIDIA guidance in this area.