I’m trying to allocate pinned memory from within my main program in C using cudaMallocHost to return a pointer to the allocated memory. The purpose is to allocate all the pinned memory I will need just once at the start of the program in order to minimise allocation overhead.

Inside C:

float* pinned = cudaMallocHostFloat( ARRAY_MEM );

Inside allocation function:

float* cudaMallocHostFloat( const size_t SIZE )


	float* tmp;

	cudaMallocHost( (void**) &tmp, SIZE );

	return tmp;


However, when I pass this pointer to a memory transfer speed test function, the speed of xfer is the same as pageable memory allocated with cudaMalloc. So my question is, can you return pointers to pinned memory, or does it become pageable once it has left the scope of the cudaMallocHost call?

Pointers to pinned memory don’t have any function scoping limits. What absolute transfer speed are you getting with pinned memory, and what kind of system are you testing this on? Previous benchmarks have indicated that systems based on the Intel X58 chipset with a triple-channel systems memory configuration can transfer pageable memory at very nearly the same speed as pinned memory.

Edit: To check if this is the case, you can run the bandwidthTest application in the CUDA SDK with and without the -memory=pinned flag.

Thank you for the reply.

I have used the bandwidthTest in the past, and iirc the results were around 2.5 GB/s for pageable and 5 GB/s for pinned on my machine (GeForce 210, can’t rem rest off the top of my head) so there is definitely a difference. My own timing subroutine gives the same results when I allocate the memory ‘in-situ’.

Which leaves me confused as to why I’m not seeing that when I allocate the pinned memory in a function and return the pointer, which then gets sent to the timing function.

To be honest, I just got around this by re-structuring my code, but it’d still be nice to find out what’s going on…

Are you using the pointer from a different CPU thread than the one that allocated it? By default, pinned memory is only known to the CUDA context that created it. CUDA 4.0 now offers a way to mark memory as pinned for all host threads, so you might be able to fix this.