Help!! cuMemHostAlloc() keeps rebooting my machine !!

Hello all,

As some of you may recall (most of you won’t), I’ve been working on this rather extensive project that will eventually (hopefully) be able to use and manage many NVidia GPUs, using a propietary algorithm that took me a little more than 2 years to develop.

The good news is that I’ve finished the coding for this project. Yay!

So now I’m at the debugging stage, and I’ve encountered what can only be described as a rather nasty surprise. Bit of a nightmare, actually.

For reasons that I won’t go into here, I’m using the NVidia Driver API to communicate to any and all of the NVidia GPU(s) in the system, which so far has been working just great.

However

The first time the program calls the NVidia Driver API function, cuMemHostAlloc(), it call it requesting 415,248,384 bytes of mapped, pinned Host memory. That call works, and I get the memory.

The second time the program calls cuMemHostAlloc() however, it call it requesting 1,660,993,536 bytes of mapped, pinned Host memory.

That’s when the unexpected happens. The function call never returns. It just reboots my entire system!

In fact, if I break up the call to cuMemHostAlloc() into four separate calls of 415,248,384 bytes each (4 X 415,248,384 = 1,660,993,536), one of the calls (not the first) will always reboot my system, whether I’m running under the debugger or not! I know at this point that I should be able to tell you which one of the calls reboots the system, but hey, I’m still in shock that any call to an NVidia Driver API function will reboot my system! Besides, I can only discern which call blows up the OS if I run the program under the debugger, and the debugger, as someone pointed out in another memory-related thread, might be aggravating the problem, so that wouldn’t tell me much anyway…

So now for some of the details. The program is a 32-bit program compiled using Visual C++ 2010 Express, and running under a 64-bit Windows 7 OS. And for the record, it’s compiled with the ‘Enable C++ Exceptions’ set to ‘No’, but since the NVidia Driver API is a ‘C’ language interface, that shouldn’t matter…(right?)

The hardware is a Dell Inspiron N7110 Laptop with eight (8) gigabytes of installed RAM, an Intel 2.4 GHz Dual-Core CPU (four logical processors), and a single NVidia GeForce GT 525M GPU with 1,073,545,216 bytes of Adapter RAM, and two (2) streaming multiprocessors (96 CUDA Cores, I think).

The program is being linked with the /LARGEADDRESSAWARE linker flag, and the Windows 7 Boot Loader’s ‘IncreaseUserVa’ setting, as reported by bcdedit, is properly set to 3072, so the program should be able to allocate up to three (3) gigabytes of Windows memory.

When the latest tests were run, the Windows ‘System Information’ utility reported that there was 5.90 Gigabytes of ‘Available Physical Memory’, so the memory was definitely there.

Also, more for the record than anything else, the first call to the cuMemHostAlloc() function, which works fine, uses the following flags:

CU_MEMHOSTALLOC_PORTABLE | CU_MEMHOSTALLOC_DEVICEMAP | CU_MEMHOSTALLOC_WRITECOMBINED

The second call, and/or the next four calls, only use these:

CU_MEMHOSTALLOC_PORTABLE | CU_MEMHOSTALLOC_DEVICEMAP

More grist for the mill: On my system, the Windows’ ‘Startup and Recovery’ option for ‘System failure’ is set to ‘Automatically restart’, so that’s probably what’s happening. But why is the cuMemHostAlloc() function causing a system failure??

Lastly, the version of the NVida Driver that the program links to is 307.21 (‘File’ version: 8.17.13.721).

So can anyone help? Any ideas? Anything?

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__MEM_g572ca4011bfcb25034888a14d4e035b9.html
Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

My understanding is that cuMemHostAlloc() is a thin wrapper around the underlying native operating system calls. It is not clear whether the behavior you observe is related to anything that CUDA does or does not do, but it would be advisable to file a bug via the registered developer website so the driver team can have a look at this issue.

It is impossible to state how much memory can be pinned safely without degrading OS operation; my best guess is for a Win32 platform with 3 GB of user memory it is probably a few 100 MB. See also this discussion on Stackoverflow:

http://stackoverflow.com/questions/12439807/pinned-memory-in-cuda

I think you need to re-evaluate your approach to memory handling. The pinned memory API was not designed (or advertised) to support memory allocations in the range of gigabytes.

Also it’s very unusual that it took you to reach “code complete” status before realizing that your approach won’t work as is. You skipped the proof of concept phase?

I found a thread on stackoverflow about some of the limitations of cuMemHostAlloc()

You could try using the TCC (Tesla Compute Cluster) drivers with patched INI files to support your GPU hardware. These drivers have a somewhat different memory handling where some driver restrictions (applying to the Vista and Windows 7 WDDM graphics driver model) are much more relaxed.

Circumventing code signing restrictions might be a problem on your 64bit windows machine. Messing with the INI file of the TCC driver will invalidate the signatures for the drivers unfortunately.

I found a thread talking about such hacks, guaranteed to be dangerous ;)
https://devtalk.nvidia.com/default/topic/489965/cuda-programming-and-performance/gtx480-to-c2050-hack-or-unlocking-tcc-mode-on-geforce/#entry1210250

Thanks to all for your timely input.

njuffa: I’m not a ‘registered developer’ (yet?), but your suggestion to ‘file a bug’ was, I think, a good one. However, I won’t be doing that anytime soon because I’m not a ‘registered developer’. On the other hand, since my understanding is that you’re reasonably well connected to the NVidia, um, administative infrastructure (for lack of better phraseology), perhaps you could pass along to the development team there that it would probably behoove their efforts to look into using the ‘LARGEADDRESSAWARE’ linker flag in their next version of the nvcuda32 DLL (if there is to be one at all). Ultimately, the lack of same was what caused the reboot…

cbuchnert: Last I looked, the ‘proof of concept’ stage in any development project does not imply having to test out every function in an API to see whether or not it’s going to reboot the operating system when called… Also, I’m nowhere near ‘realizing that my approach won’t work’. My ‘approach’ is perfectly sound, but is, of necessity, comparatively resource intensive. Your comments, in the aggregate, would seem to strongly imply that you are operating under the assumption that any application which attempts to utilize the gigabytes of memory available on modern microcomputers is, somehow, ‘not well thought out’. Suffice it to say that I do not share that opinion. On the contrary, I would suggest that you revisit your rationale for that opinion. On an obviously less confrontation note, thanks very much for the link - it was indeed highly informative…

The problem has been solved (at least for now), but I had to patch the ‘nvcuda32.dll’ file to do it.

Thanks again to all.

Generally speaking, the forums provide a platform for a community of GPU developers, they are not designed as a support channel.

Occasionally I recommend the filing of bugs when, based on my experience, this appears to be the most advantageous course of action. Becoming a registered developer is not difficult, the process has been streamlined considerably (I signed up as aregistered developer myself under both the old and new systems). Approval typically occurs within one business day of a request. To start, visit https://developer.nvidia.com/user

Thanks njuffa.

I’ve submitted my ‘application’ to be a ‘registered developer’, so it’s being ‘processed’ right now…

Honestly though, I don’t see the point of such a demarcation. Can’t NVidia just assume that anyone who uses their development software is a ‘developer’? Just saying…

I apologize if my comments sounded too confrontational. I am happy that you were able to solve this problem!

And indeed page locking several Gigabytes of physical memory on a 64 bit system will be somewhat less disastrous than on a 32 bit system. Still the operating system will have much less breathing room afterwards. If you expect your CUDA application to pretty much fully utilize/consume the system while running, then this approach may be acceptable.

Sorry for the inconvenience and thank you for filing the bug.

Well, this is embarrassing for me, but that never stopped me before, so here goes:

It turns out that I was quite wrong to assert that patching the ‘nvcuda32.dll’ file solved the problem. I debugged with the wrong data. When I debugged with the right data, the cuMemHostAlloc() function still caused a reboot.

So I’m actually in the middle of creating a bug report for same, now that I can.

Also, since the last posting in this thread, I’ve tried replacing the cuMemHostAlloc() call with a call to the following function:

CUresult CUDAAPI XpndedHostAlloc( void **pp, size_t bytesize, unsigned int Flags )
{
    if ( *pp = VirtualAllocEx( GetCurrentProcess(), NULL, bytesize, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE | PAGE_NOCACHE ) )
         return( (*cuMemHostRegister)( *pp, bytesize, Flags ) );

    return( CUDA_ERROR_OUT_OF_MEMORY );
}

However, when I ran the above under the debugger, I found that although the call to the VirtualAllocEx() function succeeded, the call to the cuMemHostRegister() function rebooted the OS…

I also left out a fairly relavent piece of information in my original post (not on purpose though), which I think is highly indicative of the possibility that what’s going on ‘underneath the hood’ so to speak, has a lot to do with the /LARGEADDRESSAWARE linker flag.

When I link my application without the /LARGEADDRESSAWARE linker flag, there is no reboot. However, if I do that, my App is left with about 70 megabytes of ‘missing’ memory, which my App is actually smart enough to try to allocate as an array in the PTX kernel itself. However, when that happens, the PTX compiler refuses to compile the kernel, and the Error Log that it returns is a zero-length string…

So I’m in kind of in a ‘damned if I do, and damned if I don’t’ situation right now…

So anyway, y’all have a great day. I’m sure I’ll figure something out eventually…

P.S. njuffa: I don’t know what you mean by my ‘inconvenience’, but in any event, I don’t see that you have any reason to apologize for anything…

P.P.S. cbuchnert: Quite right about the app. It was, and is, designed from the ground up to be run as a dedicated app on a dedicated machine. A machine that will allocate as little as possible to Windows (within reason), and as much as possible to the App itself.