Question Regarding an Array of Pointers on the Device

Hi. I am trying to send a large data structure on the host to the device which is in the form of an array of pointers (to arrays of floating point data). I am trying to get each thread block to process one array.

However, when I compile, I get the following message (which I think is relevant to this problem):

/tmp/tmpxft_00006618_00000000-7_SpMVBlock.cpp3.i(176): Advisory: Cannot tell what pointer points to, assuming global memory space

When I have each thread load a particular data from the global memory and write it back into another structure, send it back to the host, and then check the numbers, I get seemingly random numbers ranging from 0 to -137241480838288389375466143744.0000000000 and a lot of “nan”.

I was wondering whether what I am trying to do is even possible, and if so, what I might be doing wrong. I would greatly appreciate it if anybody knows what I should do.

I’ve ran the code in emulation mode and it seems to work fine (almost). I have a slight problem because the emulator seems to execute each thread to the end before it executes the next thread and my code requires each thread to sync at certain points for coalesced memory loads to the shared memory.

I’ve included my relevant code below.

Thanks.

p.s. also, I’m getting a lot of incoherent loads in the emulation mode. Anybody have any idea why that might be?

============

Host:

    // Allocate memory to store pointers to arrays.
    cudaMalloc( (void**) &Ad_values, (numRows*sizeof(DTYPE*)) );
    cudaMalloc( (void**) &Ad_colidx, (numRows*sizeof(int*)) );

    // Allocate memory for each array.
    for(i=0;i<numRows;i++) {
            // compute how much data to allocate
            tmp1 = 0;
            tmp2 = (i*maxNumBlocks);
            for(j=0;j<A_num_blk_per_row[i];j++) {
                    tmp1 = tmp1 + A_max_rowsize_per_row[tmp2];
                    tmp2++;
            }
            // allocate memory on the device.
            cudaMalloc( (void**) &(Ad_values[i]), (tmp1*BLOCK_SIZE_Y*sizeof(DTYPE)) );
            cudaMalloc( (void**) &(Ad_colidx[i]), (tmp1*BLOCK_SIZE_Y*sizeof(int)) );
    }

    // transfer the data from the host to the device.
    for(i=0;i<numRows;i++) {
            // compute how much data to send.
            tmp1 = 0;
            tmp2 = (i*maxNumBlocks);
            for(j=0;j<A_num_blk_per_row[i];j++) {
                    tmp1 = tmp1 + A_max_rowsize_per_row[tmp2];
                    tmp2++;
            }
            cudaMemcpy( Ad_values[i], A_values_reformat[i], (tmp1*BLOCK_SIZE_Y*sizeof(DTYPE)), cudaMemcpyHostToDevice);
            cudaMemcpy( Ad_colidx[i], A_colidx_reformat[i], (tmp1*BLOCK_SIZE_Y*sizeof(int)), cudaMemcpyHostToDevice);
    }

SpMVBlock_kernel<<<dimGrid, dimBlock>>>(Ad_values, Ad_colidx, Xd, Yd, m);

Kernel:
float* baseAddr1;
int* baseAddr2;

   bid = blockIdx.x;
   tid = threadIdx.x;

   baseAddr1 = Ad_values[bid];
   baseAddr2 = Ad_colidx[bid];

  Yd[(bid*BLOCK_SIZE_Y)+tid] = baseAddr1[tid];

============

You cannot dereference cuda memory pointer in host. And that’s exactly what you do when you try to use the array of pointers, which are allocated in GPU global memory, to store memory pointers.

I believe you can allocate the memory of the array of pointers in host and assign all GPU memory pointers to this host array. Then you can do a cudaMemcpy() to copy this array of pointers to GPU global memory.