How slow is constant memory host-device transfer? The transfer is 25 times slower than my heavy kern


I’m having an issue with transferring data to constant memory from the host side. The actual transfer works, its just dead slow. Here are my rough numbers:

Transfer 128 bytes to constant = 25ms
transfer 60Kb of data to host into regular global memory = 1ms.

The weird thing is that the time only shows up when I use regular CPU-timers in the code on the host side, but no in the Visual profiler. The problem hence seems to be a lag on the host-side. Has anyone seen similar behavior? The code compiles, runs and computes the correct answer, and the on-chip access seems as fast as it should be. Its just that initiating the transfer takes a relatively long time.

Any comments or insight would be greatly appreciated.

/ Ian


Try to call any CUDA function before the timer start. For example you can get the number of CUDA devices… Thus, you are initializing the device. Maybe your first transfer is doing that and for this reason you get a long time.

Please, tell us your results.


Thanks for your advice. Sadly it did not work; there already was some CUDA initialization code. But your suggestion inspired something that did work!

The code now has an write-to-constant initialization function. It consist of writing one float (the value of 42 infact :rolleyes: ) into constant memory. This still takes roughly 25 ms, but when the second constant write is done, it goes as fast as it should! As the code is a simple benchmark, the earlier initialization is no problem for me, though in a “real world” application it would still be an issue.

So my problem is fixed! The question why this warm up of constant memory exists is still a mystery though.

Any new thoughts?

Does anyone have an idea of why this “warmup” is required? I used to think that this would be taken care of in the cudaSetDevice call.