CUDA 4.0 cudaHostAlloc

MKasper · March 31, 2011, 6:36pm

Hi folks,

here’s my Prob:

tracedata= new char[(size_t)NumberOfTraces*NUMOFPOINTS];//(char*)malloc(NumberOfTraces*NUMOFPOINTS*100);

Works!

if((error=cudaHostAlloc((void**) &tracedata, (size_t) NumberOfTraces * NUMOFPOINTS, cudaHostAllocPortable|cudaHostAllocWriteCombined))!=cudaSuccess)

{

	matrixsize=(NumberOfTraces/1024) * (NUMOFPOINTS/1024);

	printf("Error 2: Cannot reserve the approx. %.0f MB of memory\n", matrixsize);

	printf("when trying to reserve memory for %ld traces\n", (unsigned long) NumberOfTraces);

	printf("with %ld data points. Reduce your data size?\n", (unsigned long) NUMOFPOINTS);

	printf("Cuda says: %s\n",cudaGetErrorString(error));

	cudaThreadExit();

	MessageBox(0,"...just to read the outputs","Pause here...",0);

	exit(2); //Error 2: Out of Mem

}

generates “out or memory” Error.

The system has 48 GB of host RAM (around 30GB required for this array). Rebooting didn’t help (my first idea was heavy memory fragmentation).

OS is Win7 x64, CUDA 4.0 is used.

It just seems as if Cuda doesn’t allow allocation of such a large array using cudaHostAlloc.

Any ideas what to try?

My plan was

“Load all data to Hostmem”
Then distribute smaller chunks to 4 Teslas and all CPUs

in other words: get rid of HDD accesses to reload stuff.

But step1 requires using a large array, and step 2 requires pinned mem for reasonable speed.

Thx,

Markus

MKasper · April 1, 2011, 7:33am

Ok, by now I realized that searching for “pinned memory” restrictions is much more informative thatn searching for “large array” + CUDA buzzwords.

So I have to reformulate my question:
What can I do to make this as fast as possible?
I have 4 6GB Teslas to fill.

But (if I understood correctly) without pinned Mem I can neither copy async nor can I transfer at full 16x Speed.
Any pain relief available? Are there ways to extend the amount of pinned MEM available?
Markus

tmurray · April 1, 2011, 4:12pm

Are you using the TCC driver?

MKasper · April 1, 2011, 4:24pm

Yes. At least I think so. Maybe =)

I’m using the devdriver_4.0_winvista-win7_64_270.32_general.exe published with CUDA4 RC.

Furthermore I’ve set AdapterType to 2 for all 4 Teslas.

I’m not 100% sure if this already counts as “TCC driver” or if it is just another “close, but no banana” setting.

I’m using the CUDA4RC driver, as I’m not sure about compatibility of other drivers (from the normal website) with CUDA4.

tmurray · April 1, 2011, 5:52pm

hm, what if you don’t allocate as write combined? (generally you should never be doing that anyway)

MKasper · April 2, 2011, 9:23am

I will test that and report back. I’m not sure whether I manage to test this over the weekend.

Do you mean I should never use WC or I should never use MEM that is not WC?

I decided for WC as I will fill the MEM by CPUs only a few times and then will consume by GPUs only.
So it’s more or less an input buffer.
Honestly I’m not sure whether to make it mapped and write combined or to make it non-WC only pinned.
Option 3 would be “to use CUDA 4.0 interface” and register it and so on. But some intuition tells me that the hand-tuned options should be better suited for the actual expected access patterns than the “universal” choice implemented using some hidden mojo.

Thanks for your help TM,
markus

MKasper · April 4, 2011, 6:56am

Hi,
if using non-WC mem I don’t get the “out of mem” anymore and it works fine.
So for now I will go this way.
Thx.

Markus

RobinTang · April 25, 2011, 2:28pm

I’m facing the same problems as you. I wanna init pinned memory as 512MB * 4, so that I could use functions like cudaMemcpyAsync. But it went wrong when “cudaHostAlloc” the 4th pinned memory.

I’ve searched a lot in google, but still can’t find the solution. Is it an NV cuda’s bug?? Have you fixed it?

MKasper · May 19, 2011, 12:06pm

Sorry, I haven’t seen your Replay earlier:
Yes, for me everything solved once I fixed the evil write combining.

RobinTang · June 12, 2011, 12:21pm

I eventually found out that my GPU(quadro fx 4800) could allocate about 1.5G pinned memory at large. It will return CudaError if allocating more pinned memory.

Topic		Replies	Views
Out Of Memory Error Allocating large chunks (> 1GB) of pinned-memory fails CUDA Programming and Performance	3	5921	June 4, 2011
Unexpected limit in cudaHostAlloc Failing to allocate large amounts of pinned/page-locked memory CUDA Programming and Performance	3	4193	December 6, 2010
amount of pinned memory CUDA Programming and Performance	17	12423	December 4, 2008
Help!! cuMemHostAlloc() keeps rebooting my machine !! CUDA Programming and Performance	9	2206	February 20, 2013
Problem with cudaMallocHost CUDA Programming and Performance	3	7962	April 23, 2009
check for cudaHostAlloc Portable possibility CUDA Programming and Performance	13	2905	July 1, 2015
Pinned memory size problem CUDA Programming and Performance	4	4030	December 11, 2009
Cuda host array allocation problem. Legacy PGI Compilers	2	4263	July 26, 2011
GTX 680 fails after large cudaHostAllocaPortable allocation CUDA Programming and Performance	11	3864	June 13, 2012
Very Slow CU_MEMHOST_WRITE_COMBINED Allocation CUDA Programming and Performance	8	11046	October 20, 2010

CUDA 4.0 cudaHostAlloc

Related topics