Complex data structure yields "Advisory" warnings Pre-determined structures in memory

I need each thread in my kernel to traverse a shared, complex data structure known at compile time. Two questions:

  1. What is the best memory or method for this?

  2. If the method I post below is okay, what can I do about the Advisory messages?

I’ve looked through several related posts. People suggest textures for lookup tables, but here the data is heterogeneous and not easily laid out on a texture. People have suggested allocating space on the card and copying host-to-device, but that seems tedious for complicated data structures, and I would like to compile to a cubin. Basically, I would rather write the data structure directly into the kernel.

Currently, I just auto-initialize the data structure right there in the function body. However, the pointers-to-pointers raises an advisory and assumes global memory–but since the data is local (or somewhere else), invalid memory is used. Here is a related example with output (left unsafe to simplify presentation). On a 2x2 block, it references into a pre-defined data structure where each kernel accesses the element at its specific row and column.

#include <stdio.h>

__global__ void copy_gmemA(int* g_odata)

{

    /* reference data structure */

    int foo_A[] = {1, 2}; /* first row */

    int foo_B[] = {3, 4}; /* second row */

    int *const_vals[] = {foo_A, foo_B};

   int idx = threadIdx.x * blockDim.y + threadIdx.y;

    g_odata[idx] = const_vals[threadIdx.x][threadIdx.y];

}

int main(void)

{

    /* allocate space for kernel output on device and host */

    int len = 4;

    int *d_odata, *h_odata;

    cudaMalloc((void**)&d_odata, sizeof(int)*len);

    h_odata = (int *)malloc(sizeof(int) * len);

   /* launch kernel */

    dim3  grid(1, 1, 1);

    dim3  threads(2, 2, 1);

    copy_gmemA<<<grid, threads>>>(d_odata);

   /* pull memory back onto host and display */

    cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost);

    for (int i = 0; i < len; i++)

        printf("%d ", h_odata[i]);

    printf("\n");

   return 0;

}

Which outputs

bash$ nvcc test.cu && ./a.out

"/tmp/tmpxft_0000b623_00000000-5_test.i", line 11: Advisory: Cannot tell what pointer points to, assuming global memory space

0 0 0 0

Here’s another way of trying to lay it out in in the kernel, but this yields both the Advisory and an assembler error when it’s trying to emit instructions (code and output pasted below). I’m assuming that when it tries to emit ‘foo’, it doesn’t know the address of ‘foo_A’ or ‘foo_B’.

I’ve found several other related posts, but nothing involving a small pre-defined data structure linked up via pointers. I would rather avoid using cudaMemcpyToSymbol() because it is a complicated, tightly linked data structure I intend to use this on.

Transfering a list of pointers to the device

Copying structures with pointers

Uniform lookup tables

Using cudaMemcpyToSymbol()

#include <stdio.h>

/* reference data structure */

__constant__ int foo_A[] = {1, 2};

__constant__ int foo_B[] = {3, 4};

__constant__ int *foo[] = {foo_A, foo_B};

__global__ void foo_kernel(int* g_odata)

{

    int idx = threadIdx.x * blockDim.y + threadIdx.y;

    g_odata[idx] = foo[threadIdx.x][threadIdx.y];

}

int main(void)

{

    /* allocate space for kernel output on device and host */

    int len = 4;

    int *d_odata, *h_odata;

    cudaMalloc((void**)&d_odata, sizeof(int)*len);

    h_odata = (int *)malloc(sizeof(int) * len);

   /* launch kernel */

    dim3  grid(1, 1, 1);

    dim3  threads(2, 2, 1);

    foo_kernel<<<grid, threads>>>(d_odata);

   /* pull memory back onto host and display */

    cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost);

    for (int i = 0; i < len; i++)

        printf("%d ", h_odata[i]);

    printf("\n");

   return 0;

}

Output:

bash$ nvcc test.cu

"/tmp/tmpxft_0000bca5_00000000-5_test.i", line 11: Advisory: Cannot tell what pointer points to, assuming global memory space

### Assertion failure at line 906 of ../../be/cg/NVISA/cgemit_targ.cxx:

### Compiler Error in file /tmp/tmpxft_0000bca5_00000000-5_test.i during Assembly phase:

### NYI initv kind 1

nvopencc INTERNAL ERROR: /usr/local/cuda/bin/../open64/lib//be returned non-zero status 1

For algorithmic simplicity, I would much rather traverse pointers through this data structure, but I can hack up the structure so that everything is laid out serially. I would use linear indices instead of pointers to stitch it all together.

There are only about three sub-structures in the data structure I intend to use, and instances of these structures could be packed end-to-end. I’m guessing this would lead to better alignment and hence faster fetching.

Question: is the best approach to lay it all out like this in constant memory?

Following the examples above, it might look something like this where ‘foo_data’ contains both rows laid one after the other, and ‘foo_rows’ contains the linear offset for each row. This compiles without error and runs correctly.

#include <stdio.h>

/* reference data structure */

__constant__ int foo_data[] = {1, 2, 3, 4};

__constant__ int foo_rows[] = {0, 2}; /* indices into foo_data where each row starts */

__global__ void copy_gmemA(int* g_odata)

{

    int idx = threadIdx.x * blockDim.y + threadIdx.y;

    g_odata[idx] = foo_data[foo_rows[threadIdx.x] + threadIdx.y];

}

int main(void)

{

    /* allocate space for kernel output on device and host */

    int len = 4;

    int *d_odata, *h_odata;

    cudaMalloc((void**)&d_odata, sizeof(int)*len);

    h_odata = (int *)malloc(sizeof(int) * len);

   /* launch kernel */

    dim3  grid(1, 1, 1);

    dim3  threads(2, 2, 1);

    copy_gmemA<<<grid, threads>>>(d_odata);

   /* pull memory back onto host and display */

    cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost);

    for (int i = 0; i < len; i++)

        printf("%d ", h_odata[i]);

    printf("\n");

   return 0;

}

I wanted to copy a structure containing pointer from host to device’s global memory. and then from global memory to shared memory. when copying from global memory to shared memory i get the advisory warning.

my structure is the following:

struct align(8) {

float* elements;

}Matrix;

and also i cant run the program.

I don’t think you’re doing anything wrong, the compiler is just easily confused. I hear 2.1 will be much better at figuring convoluted pointers out. (The compiler has to do something the C language doesn’t support: keeping track of the nature of the pointer and whether it should access local, global, shared, or constant mem when dereferenced. I guess just using different address ranges for each type was too simple.)

If you play around with the syntax (not necessarily resort to indices) you might get it to work.

Or you can just do:
g_odata[idx] = const_vals[threadIdx.y*stride + threadIdx.x];
which is the most common way to access arrays.