cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!!

christoph · July 2, 2007, 11:46am

Hi,

I did some performance tests in which I allocated host memory (for my input data) with “cudaMallocHost()” (PINNED mode) and “malloc()” (PAGED mode). I tested different mem sizes (1 byte, 50, 100, 1000 and 10000 bytes) and it seems that the first time you allocate memory in your programm using “cudaMallocHost()” it lasts around 90 milliseconds.

here are the exact times:

Memsize  PINNED1  PINNED2  PINNED3  PAGED  cudaMalloc (Device)    

1          98.243408  0.117040  0.116175  0.001571  0.002575    

50          84.392342  0.115694  0.114750  0.001632  0.002571    

100          85.650665  0.116036  0.114686  0.001763  0.002500    

1000          95.233505  0.116254  0.115002  0.004049  0.002560    

10000          86.180138  0.122314  0.119716  0.007654  0.091254

PINNED1 is the 1st time I allocate PINNED mem, PINNED2 the 2nd and so on. Can someone reproduce this or can a Nvidia fellow confirm this?

thanks in advance and best regards,

christoph

pluk · July 2, 2007, 12:55pm

Hi,

I did some performance tests in which I allocated host memory (for my input data) with “cudaMallocHost()” (PINNED mode) and “malloc()” (PAGED mode). I tested different mem sizes (1 byte, 50, 100, 1000 and 10000 bytes) and it seems that the first time you allocate memory in your programm using “cudaMallocHost()” it lasts around 90 milliseconds.

here are the exact times:
PINNED1 is the 1st time I allocate PINNED mem, PINNED2 the 2nd and so on. Can someone reproduce this or can a Nvidia fellow confirm this?

thanks in advance and best regards,

christoph

[snapback]217043[/snapback]

This could be because of some startup overhead in the first usage of the graphics card. If you first call a ‘dummy’ kernel and after that start your timing, result will probably be better.

christoph · July 2, 2007, 1:09pm

Sure, but it seems that this is “cudaMallocHost()” specific. The plain old “malloc()” has no overhead, right?

prkipfer · July 2, 2007, 1:55pm

This is expected as pinning the memory means prohibiting relocation by any resource. Thus the operating system has to do quite some reorganization to ensure that. This certainly needs a multi-threading lock to avoid race conditions, so it is very expensive. The following allocations are faster because the OS probably took the chance to reserve more memory the first time the process requested some.

Peter

nathanbp · July 2, 2007, 3:37pm

Also, it’s specified that CUDA is initialized the first time you call a CUDA function. So if that’s the only function you’re calling, it’s possible that you are including the CUDA initialization time in the time to call it the first time.

christoph · July 3, 2007, 6:25am

thanks to all of you for your answers!

have a nice day,
christoph

Topic		Replies	Views
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7608	January 5, 2012
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2138	August 26, 2011
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8845	November 17, 2021
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	536	March 28, 2024
cudamalloc slow CUDA Programming and Performance	5	8342	November 13, 2015
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3163	September 16, 2010
Memory allocation time CUDA Programming and Performance	0	1256	January 19, 2008
mlock versus cudaHostAlloc CUDA Programming and Performance	4	1218	September 20, 2019
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8651	February 14, 2008
malloc() + cuMemHostRegister() faster than cuMemAllocHost() CUDA Programming and Performance	0	1080	October 9, 2013

cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!!

Related topics