Some things to point out:
All of your code examples are completly serial. All threads perform exactly the same thing, which is kinda pointless. Are you aware of that?
__constant__ double* constant_a;
This means that a pointer to double resides in constant memory, but the actual data is not! As a result your line
volatile_a[i][j] = constant_a[i * 16 + j];
will be compiled to something like this:
offset1:=i * 16 + j;
If you want a whole array to be in constant memory, write this instead:
error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);
constant_a (as it is declared so far) has a value which resides on GPU, thus cannot be accessed directly on the host side. I am surprised the compiler did not complain about this thing. Unless you use cudaemu, you should read some garbage instead. So you move host_constant_a data to some random position in GPU memory.
Similarly &constant_a should raise a compile error. You use cudaGetSymbolAddress (or something like that) to get address of device (or constant?) variables. Alternatively you may want to use cudaCopyToSymbol (or something like that) function to store data into constant memory. Constant variables are like global variables, they are allocated upon declaration so no need to call for cudaMalloc on them.
Declares an array for each thread. Since it is big, it cannot fit into 16 (or 32) of 32-bit registers, it is stored in local memory. Physically local memory is a global memory. As a result all you do in that loop is to copy data from one place in global memory to another place in global memory!