Using Constant memory to hold Parameters

Hello! I have a CUDA program where I call a few kernels multiple thousands of times. These kernels take struct arguments which contain a few dozens of data members. I understand that every time a kernel is launched the arguments must be copied to the GPU (I think into constant memory). Therefore I thought that because most of these arguments are not changing I could store them directly in constant memory and avoid the slight overhead.

The first question is, is there any possible performance gain to this constant memory switch?

The second question is, how? It looks like one cannot simply put a struct in constant memory)- CUDA C Programming Guide v4.2 D.2.1.1. I decided to put painfully put each data member in constant memory, however somewhere I am going wrong in this logic or implementation:
1- Some of the members are arrays
2- I do not want to store the arrays in the constant memory
3- I only need a pointer to the array in the constant memory

I am trying the following:
I create the struct on the CPU, whose array members are allocated on the GPU, the rest are constants or flags. I am trying to cudaMemcpyToSymbol the reference of the created GPU array into a constant pointer to a pointer as in the sample below:

__constant__ int** my_struct_a;
...
    cudaError_t error1_a = cudaMemcpyToSymbol(my_struct_a, &(data_cpu.a), sizeof(int**));

where ‘data_cpu’ is the struct and the data member ‘a’ is an int* allocated on the GPU.

Although ‘error1_a’ is a success, when the kernel tries to access the actually data it fails. Debugging the kernel shows that ‘my_struct_a’ is a ???, or simply failed to read memory.

How can I fix this? If you need any details to clarify things, I will be happy to provide them.

Some indications please?

There was some discussion of constant memory on this list before.
It turns out that constant memory can have high overheads, which make
it less attractive than it first appears. Eg if threads read different
words they cannot do this in parallel but must wait for the other
threads. Also constant memory is not really on chip memory at all
but a much smaller read-only cache to a 64KB chunk of
global memory.
See section 8.4 “Creating and Debugging Performance CUDA C”, W. B. Langdon