My application needs lots of constant memory, more than the available 64KB. Therefore I’d like to change the contents of the constant memory between kernel calls. This should happen as fast as possible and therefore want to avoid having to copy data back and forth between host and device. Is it possible to copy all the data - say 256KB - to global memory during startup so that I later declare one certain region of global memory as being my constant memory? Or is the only way to change constant memory between kernel calls by copying from host to device? (I’d rather not want to use texture memory for several reasons)
You can use the cudaMemcpyToSymbol function with a cudaMemcpyDeviceToDevice argument. This should be fast, although I haven’t timed it. (Let me know if you do.)
Thanks for the hint. I did a little benchmark with both arguments. cudaMemcpyDeviceToDevice is indeed a lot faster, as to be expected. For the test I copied a random array of 64 KB for 10000 iterations. Here are my results on a GTX 280 on a Intel QuadCore Q9450 2.66GHz running Ubuntu 8.04:
HOST TO DEVICE COPY (cudaMemcpy)
Total time : 544.484009 msec
Average: 0.054448 msec for one copy
HOST TO DEVICE COPY (cudaMemcpyToSymbol)
Total time : 551.583008 msec
Average: 0.055158 msec for one copy
DEVICE TO DEVICE COPY (cudaMemcpyToSymbol)
Total time : 60.993999 msec
Average: 0.006099 msec for one copy
Conclusion: cudaMemcpyDeviceToDevice is more than 9 times faster.
As a reference, the Host to Device test from the SDK bandwidthTest takes reaches 1864.5 MB/s on my setup.