I’m trying to use

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_A, h_A, mem_size_A) );

to load data from host to constant memory. But the timing I get is much much larger than when I use

CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, mem_size_A,cudaMemcpyHostToDevice));

to load the same data from host to device.

Am I doing anything wrong here Or is it supposed to be this way?


If you manage to get the memory address if the constant you can just use the second. But last time I tried this wasn’t possible when using the Cuda runtime, only when using the API directly :/

My main question is: Is writing to constant memory slower than writing to global memory?

What do you mean with slower?

It can be that the memCpyToSymbol is the first cuda method and the card need to be initialized this can cause a large slowdown. I’m using cudaMemcpyToSymbol like this:

CUDA_SAFE_CALL(cudaMemcpyToSymbol("const_d_a", h_a, sizeof(h_a)));

Don’t pin me down on it doing this out of my head.

I know accessing Global memory is a little slower than Constant.

But I don’t know about initializing.

It should be the same speed…

Constant memory size is limited to 64 kilobytes so I don’t see how slower memcpy can affect overall performance. And I don’t see why copying to const space should be slower than copying to global space: const is just global with some caching.