use constant memory to pass kernel parameters as struct

I need to use the constant memory to pass a bunch of parameters as a struct to a kernel.

To do so, in my .cl file, I define the s

typedef struct KernelParams {

  float4 ps,c0;

  float4 maxidx;

  uint4  dimlen,cp0,cp1;

  uint2  cachebox;

  float  minstep;

  float  twin0,twin1,tmax;

  float  oneoverc0;

  unsigned int isrowmajor,save2pt,doreflect,dorefint,savedet;

  float  Rtstep;

  float  minenergy;

  float  skipradius2;

  float  minaccumtime;

  unsigned int maxdetphoton;

  unsigned int maxmedia;

  unsigned int detnum;

  unsigned int idx1dorig;

  unsigned int mediaidorig;

} KParam __attribute__ ((aligned (16)));


__kernel void main_loop( ..., __constant KParam gcfg[]){


   if(gcfg->doreflect && ...){




in my host unit, I define the (exactly) same struct in the header file,


KParam param={...};

     cl_mem gparam;



     mcx_assess(clSetKernelArg(kernel,15, sizeof(cl_mem), (void*)&gparam));

     mcx_assess(clEnqueueNDRangeKernel(commands,kernel,1,NULL,mcgrid,mcblock, 0, NULL, NULL));


when running this code with an ATI card, it did ok and the values are corrected

passed into the kernel. However, when running it on a nvidia card with CUDA 3.2,

the buildprogram phase failed with segfault error. If I replace “if(gcfg->doreflect && …)”

to “if(0)”, the program can compile but the results are incorrect.

Did I do anything obviously wrong in this case? I am particularly interested

if my alignment settings and the orders of the struct members are ok.

Any comment is welcome! thanks in advance

by the way, I did aware that the maximum # of constant parameters to a kernel is 8; in my kernel, I have only 4 __constant parameters, and gcfg is the last one.

We often had the same kind of problem …
Check that your struct is aligned at least on a 128 bits base (= multiple of 4 int).
Print the sizeof(KernelParams) on both your host and your device it should be the same.
If different or not on a 128 bits basis introduce some padding with a few int dummies.

thanks. I printed sizeof(KParam) in the host and the device, and their sizes are different.

In the host, the size is 180, in the device, it is 192. I prepend all member types with

cl_, for example, float4 -> cl_float4 etc in the host code, and their sizes now match.

But now I am still having difficulties to run this code on my GTX 470 and the results

appears to be random: for 80% of the time, I get an “Out of resources” error, and for the

rest 20% cases, I get results but they look strange. I am now commenting out line by line

and debug the code. My impression is that nvidia’s OpenCL compiler is buggy. If any of

the developers are interested in testing my code, please email me.

I took a crude approach to alignment problems. I returned all the parameters from the constants structure in a output buffer that I read an compared on the cpu.
Where the results started to not match I put a pad. Eventually it all worked. I am sure I should not have to do this