I’m having an issue with transferring data to constant memory from the host side. The actual transfer works, its just dead slow. Here are my rough numbers:
Transfer 128 bytes to constant = 25ms
transfer 60Kb of data to host into regular global memory = 1ms.
The weird thing is that the time only shows up when I use regular CPU-timers in the code on the host side, but no in the Visual profiler. The problem hence seems to be a lag on the host-side. Has anyone seen similar behavior? The code compiles, runs and computes the correct answer, and the on-chip access seems as fast as it should be. Its just that initiating the transfer takes a relatively long time.
Any comments or insight would be greatly appreciated.