Slow local memory, feigned constant memory. coalesced? global?

Dear Cygnus X1,

Thank you for your helpful comments.

I will follow your path.

============================

  1. They are serial, they are doing almost the same work because it is Monte-Carlo algorithm implementation. But the code I posted here is not the whole program, there are some other calculations going on there.

============================

  1. That makes sense! Same as const in C++. So in order to place the data in constant I do
__constant__ double constant_a[16][16];

============================

  1. This was taken from the NVIDIA CUDA Programming Guide, where they do
float* d_A; 

	cudaMalloc((void**)&d_A, size); 

	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

and it works fine for me. Sure, in the case if we have data in the constant memory, we want to get an address of it in the first place. Thank you for mentioning that.

It’s the first time I hear so.

============================

Sure it is in local :) I think it is no doubt in the local memory, also because indexing is not static, loops are not unrolled.

Yes, all I do is copying data from constant memory space to global local memory, and then I do some operations on that local data (solve the system, to be exact, function is given in my second post here).