Memory allocation problems in CUDA 3.0 final vs. beta

I’m using CUDPP 1.1 to radix-sort integer key-value arrays of size 1,000,000 to 10,000,000.

In order to generate the keys I need additional CUDA memory of approximately 4 times the size of the key-value arrays and must map some D3D10 buffers.

With CUDA 3.0 final I get a hard crash in the cudppPlan(…) call (no return value as debugging stops immediately) when I’m using more than 6,000,000 elements. I assume this has to do with the allocated memory because it doesn’t happen for smaller data sizes or when some CUDA allocations before the cudppPlan(…) are skipped.

With CUDA 3.0 beta the code runs just fine.

Somebody else experienced similar problems with the final?

While this may be a CUDA bug and not a CUDPP bug, it would be good to make sure. We are getting a 1.1.1 release of CUDPP ready for release to support CUDA 3.0 and Fermi. Can you do us a favor and try it out if you can? Since the release is not ready, you’ll need to get 1.1.1 from SVN at…/branches/1.1.1

You can use the command

svn checkout cudpp-1.1.1

To get it.

If the problem still exists with the newer CUDPP, can you please provide a simple repro app?


I’ve tried cudpp-1.1.1. Same problem with CUDA 3.0 final (but works with 3.0 beta).

I will try to get a repro app running.

PS: Would be great if you could keep the VS Settings for x64 up-to-date (includes, libs, codegen, files which are excluded from build).

No repro app yet, but perhaps the following could give a hint:

I’ve tried final drivers and final sdk with beta toolkit (cudatoolkit_3.0-beta1_win_64.exe). This works with the smaller datasets (no problem with these at any version), but crashes for the big dataset. Interestingly it crashes at a different position: Instead during initialization (cudppPlan) it crashes later at runtime in a cudaMemcpy.

Let’s see if I can reproduce this with a smaller app…

There were some stability changes to memory allocation under WDDM that will result in being to allocate slightly less memory in 3.0 final versus beta, so if you’re not checking for errors somewhere, that is likely the cause.


Sounds interesting. Could you give me details on this?

I wasn’t able to build a small repro app, but I’ve managed to get it running with a debug build of CUDPP 1.1.1 and indeed it crashes at a “cudaMalloc” with errorcode “cudaErrorMemoryAllocation”.

I’ve got it running with CUDA 3.0 final by allocating less GPU memory via CUDA (now about 500 MB, which negatively affects streaming in my app).

With ~ 700 MB memory allocations (3 x 200 Meg + some small stuff) it still crashes with “cudaErrorMemoryAllocation”, which is rather frustrating on a Quadro FX 5800 with 4 Gig.

Would be great to see some details, on how to make use of the entire onboard memory, but thanks for the infos you gave me already.