Constant memory seems to be the right place to put global parameters for the kernel threads.
So I placed everything in a struct, which I keep two copies, on CPU and GPU memories :
struct Context {
int n;
float *ptr;
... many other parameters...
};
static Context h_context;
__device__ __constant__ Context d_context;
Then, my typical kernel code is :
__global__ void myKernel()
{
if( threadIdx.x < d_context.n )
d_context.ptr[ threadIdx.x ] = 0.0f;
...
}
When I call that (stupidly) simple and very short kernel with :
h_context.n = 200;
cudaMalloc(&h_context.ptr, h_context.n * sizeof(float));
cudaMemcpyToSymbol(d_context, &h_context, sizeof h_context);
myKernel<<<256,256>>>();
The program blocks for 5 secondes, then windows timeout and I get a launch error…
Further investigation shows that :[list=1]
[*] My program is correct: If I just remove the constant in the above code,
everything works slowly, but fine.
[*] Slowly because nvcc use many many registers. Most probably to copy the whole struct in registers (let’s call it a poor man caching policy), and then I end up with a low number of threads by multiprocessors.
[*] If I do myself a copy in shared memory, (from the device copy), or in
local memory, it is still slow. nvcc still caches the whole shared/local memory with registers.
[*] If I use -maxrregcount to minimise register used, it is worse. cubin output shows things end up in local memory.
[*] :) But I found a kludge : putting a __syncthreads() as first instruction of my kernel fix the problem…
I do suspect a driver/compiler bug…