Variable number of kernel parameters Address of kernel parameter stack?

I would like to pass in a variable number of parameters to a kernel, the first parameter containing the number of parameters following. My thought is to push the parameter count, then push each parameter, and inside the kernel access parameters via an offset from the the address of the first parameter (after the count parameter). However, it seems that while the parameters can be passed in successfully, accessing them via this offset method fails (retrieves values of zero rather than the correct value).

For example, here’s a kernel that takes the address of the first parameter in the variable list ‘first_param’, and uses offsets from that to retrieve subsequent parameters. The problem is that it while it correctly reads/writes the first parameter (params[0]), it fails on the second (params[1]). (It’s likely working because the compiler smooths out the ‘&’ and array reference.) To simplify things, I just launch it with one thread, one block via cudaConfigureCall(1,1) and push each parameter via cudaSetupArgument(). Here’s the kernel:

__global__ void kernel(int *d_data, int nparams, int first_param)

{

    int *params = &first_param;

    d_data[0] = nparams;

    d_data[1] = params[0];

    d_data[2] = params[1];

}

I then thought that maybe it wasn’t even allocating space for the second parameter, so I tried another kernel that had second_param in the argument list. Directly using the variable’s name (‘second_param’) worked but using the offset reference (params[1]) failed again. Here’s this second kernel:

__global__ void kernel(int *d_data, int nparams, int first_param, int second_param)

{

    int *params = &first_param;

    d_data[0] = nparams;

    d_data[1] = params[0];

    d_data[2] = params[1];

    d_data[3] = second_param;

}

Below I’ve pasted the full test file for the second case. It retrieves the first parameter just fine, but returns zero for the second parameter.

#include <stdlib.h>

#include <stdio.h>

__global__ void kernel(int *d_data, int nparams, int first_param, int second_param)

{

    int *params = &first_param;

    d_data[0] = nparams;

    d_data[1] = params[0];

    d_data[2] = params[1];

    d_data[3] = second_param;

}

int main(void)

{

    int nparams = 2, params[] = { 10, 11 };

   /* zero out device memory */

    size_t sz = (nparams+2) * sizeof(params[0]);

    int *d_data, *h_data = (int *)malloc(sz);

    cudaMalloc((void**)&d_data, sz);

    cudaMemset(d_data, 0, sz);

   /* push parameters on stack and launch */

    cudaConfigureCall(1, 1);

    size_t off = 0;

    cudaSetupArgument(d_data, off);

    off += sizeof(d_data);

    cudaSetupArgument(nparams, off);

    off += sizeof(nparams);

    for (int i = 0; i < nparams; i++) {

        cudaSetupArgument(params[i], off);

        off += sizeof(params[i]);

    }

    cudaLaunch(kernel);

   /* pull back results and display */

    cudaMemcpy(h_data, d_data, sz, cudaMemcpyDeviceToHost);

    for (int i = 0; i < nparams + 2; i++)

        printf("h_data[%d] %d\n", i, h_data[i]);

   return 0;

}

The output (below) correctly prints the number of parameters (2) and the first parameter(10), but, using the offset method to access the second parameter it wrongly returns zero. Accessing the second element directly by name correctly returns (11).

h_data[0] 2

h_data[1] 10

h_data[2] 0

h_data[3] 11

Considering alternatives, I would rather not use cudaMalloc() to send over only a few bytes. I was hoping to get away with using the function stack.

Any ideas?

It looks like it may just be unable to take the address of a variable, so the above code was not working. I ended up pushing parameters onto the call stack and was able to get them into the shared extern memory.