How slow is constant memory host-device transfer? The transfer is 25 times slower than my heavy kern

Ian_Wainwright · December 2, 2009, 9:26am

Hi!

I’m having an issue with transferring data to constant memory from the host side. The actual transfer works, its just dead slow. Here are my rough numbers:

Transfer 128 bytes to constant = 25ms
transfer 60Kb of data to host into regular global memory = 1ms.

The weird thing is that the time only shows up when I use regular CPU-timers in the code on the host side, but no in the Visual profiler. The problem hence seems to be a lag on the host-side. Has anyone seen similar behavior? The code compiles, runs and computes the correct answer, and the on-chip access seems as fast as it should be. Its just that initiating the transfer takes a relatively long time.

Any comments or insight would be greatly appreciated.

/ Ian

dcd16043 · December 2, 2009, 9:44am

Hello.

Try to call any CUDA function before the timer start. For example you can get the number of CUDA devices… Thus, you are initializing the device. Maybe your first transfer is doing that and for this reason you get a long time.

Please, tell us your results.

Ian_Wainwright · December 2, 2009, 10:26am

Hi!

Thanks for your advice. Sadly it did not work; there already was some CUDA initialization code. But your suggestion inspired something that did work!

The code now has an write-to-constant initialization function. It consist of writing one float (the value of 42 infact :rolleyes: ) into constant memory. This still takes roughly 25 ms, but when the second constant write is done, it goes as fast as it should! As the code is a simple benchmark, the earlier initialization is no problem for me, though in a “real world” application it would still be an issue.

So my problem is fixed! The question why this warm up of constant memory exists is still a mystery though.

Any new thoughts?

Jimmy_Pettersson · December 7, 2009, 10:07am

Does anyone have an idea of why this “warmup” is required? I used to think that this would be taken care of in the cudaSetDevice call.

Topic		Replies	Views
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
Strange behaviour of constant memory constant memcopy from device is not faster than host CUDA Programming and Performance	0	2044	February 22, 2010
Slow device to host transfer CUDA Programming and Performance	1	3095	June 14, 2007
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2676	July 27, 2017
why constant memory doesn't work in my example application CUDA Programming and Performance	3	8387	January 10, 2011
Constant memory CUDA Programming and Performance	7	2618	February 24, 2010
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7890	August 16, 2007
Constant Memory Bandwidth Program CUDA Programming and Performance	1	1594	May 19, 2011
Data transfer in/out CUDA API time unstable CUDA Programming and Performance cuda	3	447	February 2, 2023
cudaMemcpyToSymbol CUDA Programming and Performance	5	11201	March 6, 2008

How slow is constant memory host-device transfer? The transfer is 25 times slower than my heavy kern

Related topics