NvMapReserveOp error

NvMapReserveOp 0x80000002 failed [12]
This was created in a situation where memory resources may have been stressed, but on the call where it rises doesn’t really have anything to do with allocation or freeing memory. We’ve been having trouble replicating it because it occurs rarely (but not rarely enough that it doesn’t cause a problem for us), and we are still going through investigations of our own. So I’m not at the point yet of providing example code, or give a definite example of “when I do this, this happens”.

I’ve been following this thread as well, and AastaLLL does say that the team is looking into it.

However, I was wondering if any clarity could be provided about this message. NvMapReserveOp doesn’t appear in any manuals I’ve come across, and I can’t seem to find it mentioned in any of the CUDA libraries. Some low level operator? Where does it come from? Does the 0x80000002 following it refer to a register? Other threads have 0 and 1, but ours is the first I’ve seen of 2. What does the number in the […] mean? The other threads mentioning this also have different numbers, but if it is a cudaError code, 12 cudaErrorInvalidPitchValue doesn’t seem quite right.

Hi,

The error is related to the memory allocation.
Since it is a low-level issue, it won’t be included in our document.

The original thread reproduce it via cuFFT with a big 4GB+ buffer.
Do you use cuFFT and create a large allocation as well?

Thanks.

That all depends on what you consider a large allocation. We allocate ~17 MB of data at a time, but reuse these; near program start, we allocate something like 46 of additional ~1 MB blocks and reuse them; in total, it puts it around 100 MB used (rough estimates). We do a cuFFT operation, but the point where the error code pops up is much later than the point where we did the FFT. So far, it seems to happen when we try to access it for a L2^2 norm on a single complex float of data…which absolutely doesn’t make sense to me right now how that operation could throw a memory allocation error

I’m monitoring memory usage right now and trying to gather more information on how it happens.

On one occurrence, however, I did catch another process trigger an OOM error and crash, which seemed to produce this same NvMapReserveOp error in the same exact place.

I’ll post more as I find out.

I should mention what I’m working with here. CUDA 10.0 on the TX2i, and I think Jetpack 4.2.1(?) (I’ll have to check that.)

Hi,

There are some following up in NvMapReserveOp 0x80000000 failed [22] when running cuFFT plans.
The error is caused by a failure when the system tries to allocate a temporary workspace for memory.
In that case, the failure is caused by a big chunk allocation which is more than 4GB.

It seems that you don’t use such a big buffer.
So we will need to reproduce this internally for a deeper investigation.
Do you find a way to reproduce this?

Thanks.

I’ve been running our code over a week constantly, stress test, in two different sessions. No show. I’m still trying to find a good way to replicate it myself.

So I’m pretty sure of what is happening in our software to produce this. We have a broker service that had been consuming more and more memory by queuing messages on a queue without a high watermark (ZMQ). Because of this, it approaches a point where memory allocation on another service (data processing) becomes…well, difficult. First time it happens in this state, it’s not allocating a huge amount of memory (maybe doing something with 2048 integer values or comparable) and throwing a NvMapReserveOp error. Next few times it happened in the coming several minutes was because it was allocating (relatively) a lot of memory (~17 MB) at once. Once the OS kicked in and forced the broker service into a OOM sigkill, services restarted and things became all peachy again.

May take a little while for me to chalk up some code to replicate this behavior, but I don’t believe that any of this is an NVIDIA bug. It was unexpected to get the NvMapReserveOp error because previous cases suggested it was a huge allocation, but it seems like the error just pops up when memory resources are (unexpectedly?) unavailable.

(I also see that somehow the above account ‘orrinjelo1’ got created. That’s me as well.)