__constant__ memory failed to load.

Constant memory seems to be the right place to put global parameters for the kernel threads.

So I placed everything in a struct, which I keep two copies, on CPU and GPU memories :

struct Context {

        int n;

        float *ptr;

        ... many other parameters...

};

static Context h_context;

__device__ __constant__ Context d_context;

Then, my typical kernel code is :

__global__ void myKernel()

{

    if( threadIdx.x < d_context.n )

        d_context.ptr[ threadIdx.x ] = 0.0f;

    ...

}

When I call that (stupidly) simple and very short kernel with :

   h_context.n = 200;

    cudaMalloc(&h_context.ptr, h_context.n * sizeof(float));

    cudaMemcpyToSymbol(d_context, &h_context, sizeof h_context);

    myKernel<<<256,256>>>();

The program blocks for 5 secondes, then windows timeout and I get a launch error…

Further investigation shows that :[list=1]

[*] My program is correct: If I just remove the constant in the above code,

everything works slowly, but fine.

[*] Slowly because nvcc use many many registers. Most probably to copy the whole struct in registers (let’s call it a poor man caching policy), and then I end up with a low number of threads by multiprocessors.

[*] If I do myself a copy in shared memory, (from the device copy), or in

local memory, it is still slow. nvcc still caches the whole shared/local memory with registers.

[*] If I use -maxrregcount to minimise register used, it is worse. cubin output shows things end up in local memory.

[*] :) But I found a kludge : putting a __syncthreads() as first instruction of my kernel fix the problem…

I do suspect a driver/compiler bug…

You look to be misunderstanding what constant does. constant only allows modifications to be done on the host side not the device side. I’m surprised that your kernel compiles as the kernel shouldn’t be allowed to write to your d_context array. Removal of the constant directive would fix the problem but what you really want to do is use the cudaMemcpyToSymbol() function from your host code to load the appropriate data into the d_context. Your kernel can then read the data but not write back to it. If you want your kernel to be able to write back then you can’t use the constant qualifier.

I would hope that the newer compiler will catch this error and fail the compile. I can’t say I’ve tried to do so though.

The kernel is not writting to d_context !

d_context.ptr is a pointer, that is read to find some base address of a cudaMalloc()'ed block sitting in global memory. Where the write do occurs.

I still have this random bug repetitively with several kernels. And all the CUDA capable driver I found. My interpretation is that kernel execution might start too early, before the constant cache is ready… Compiler bug ? Driver bug ? Chip bug ? Or just a bad dream ???

PS. Running CUDA 1.0 on a 8800GTX / Windows / Driver 162.15.

Well… my guess is: just do not use a struct. If you don’t want many memcpys, use one big int array (or char array), and manually cast everything.
I experienced nvcc’s break down when handling complicated structs. So I end up stop using it. Maybe it mistakenly thinks you’re writing the struct and copies it.