CUDA 4.0 cudaHostAlloc

Hi folks,

here’s my Prob:

tracedata= new char[(size_t)NumberOfTraces*NUMOFPOINTS];//(char*)malloc(NumberOfTraces*NUMOFPOINTS*100);


if((error=cudaHostAlloc((void**) &tracedata, (size_t) NumberOfTraces * NUMOFPOINTS, cudaHostAllocPortable|cudaHostAllocWriteCombined))!=cudaSuccess)


	matrixsize=(NumberOfTraces/1024) * (NUMOFPOINTS/1024);

	printf("Error 2: Cannot reserve the approx. %.0f MB of memory\n", matrixsize);

	printf("when trying to reserve memory for %ld traces\n", (unsigned long) NumberOfTraces);

	printf("with %ld data points. Reduce your data size?\n", (unsigned long) NUMOFPOINTS);

	printf("Cuda says: %s\n",cudaGetErrorString(error));


	MessageBox(0,"...just to read the outputs","Pause here...",0);

	exit(2); //Error 2: Out of Mem


generates “out or memory” Error.

The system has 48 GB of host RAM (around 30GB required for this array). Rebooting didn’t help (my first idea was heavy memory fragmentation).

OS is Win7 x64, CUDA 4.0 is used.

It just seems as if Cuda doesn’t allow allocation of such a large array using cudaHostAlloc.

Any ideas what to try?

My plan was

  1. “Load all data to Hostmem”

  2. Then distribute smaller chunks to 4 Teslas and all CPUs

in other words: get rid of HDD accesses to reload stuff.

But step1 requires using a large array, and step 2 requires pinned mem for reasonable speed.



Ok, by now I realized that searching for “pinned memory” restrictions is much more informative thatn searching for “large array” + CUDA buzzwords.

So I have to reformulate my question:
What can I do to make this as fast as possible?
I have 4 6GB Teslas to fill.

But (if I understood correctly) without pinned Mem I can neither copy async nor can I transfer at full 16x Speed.
Any pain relief available? Are there ways to extend the amount of pinned MEM available?

Are you using the TCC driver?

Yes. At least I think so. Maybe =)

I’m using the devdriver_4.0_winvista-win7_64_270.32_general.exe published with CUDA4 RC.

Furthermore I’ve set AdapterType to 2 for all 4 Teslas.

I’m not 100% sure if this already counts as “TCC driver” or if it is just another “close, but no banana” setting.

I’m using the CUDA4RC driver, as I’m not sure about compatibility of other drivers (from the normal website) with CUDA4.

hm, what if you don’t allocate as write combined? (generally you should never be doing that anyway)

I will test that and report back. I’m not sure whether I manage to test this over the weekend.

Do you mean I should never use WC or I should never use MEM that is not WC?

I decided for WC as I will fill the MEM by CPUs only a few times and then will consume by GPUs only.
So it’s more or less an input buffer.
Honestly I’m not sure whether to make it mapped and write combined or to make it non-WC only pinned.
Option 3 would be “to use CUDA 4.0 interface” and register it and so on. But some intuition tells me that the hand-tuned options should be better suited for the actual expected access patterns than the “universal” choice implemented using some hidden mojo.

Thanks for your help TM,

if using non-WC mem I don’t get the “out of mem” anymore and it works fine.
So for now I will go this way.


I’m facing the same problems as you. I wanna init pinned memory as 512MB * 4, so that I could use functions like cudaMemcpyAsync. But it went wrong when “cudaHostAlloc” the 4th pinned memory.

I’ve searched a lot in google, but still can’t find the solution. Is it an NV cuda’s bug?? Have you fixed it?

Sorry, I haven’t seen your Replay earlier:
Yes, for me everything solved once I fixed the evil write combining.

I eventually found out that my GPU(quadro fx 4800) could allocate about 1.5G pinned memory at large. It will return CudaError if allocating more pinned memory.