Hi. I am trying to send a large data structure on the host to the device which is in the form of an array of pointers (to arrays of floating point data). I am trying to get each thread block to process one array.
However, when I compile, I get the following message (which I think is relevant to this problem):
/tmp/tmpxft_00006618_00000000-7_SpMVBlock.cpp3.i(176): Advisory: Cannot tell what pointer points to, assuming global memory space
When I have each thread load a particular data from the global memory and write it back into another structure, send it back to the host, and then check the numbers, I get seemingly random numbers ranging from 0 to -137241480838288389375466143744.0000000000 and a lot of “nan”.
I was wondering whether what I am trying to do is even possible, and if so, what I might be doing wrong. I would greatly appreciate it if anybody knows what I should do.
I’ve ran the code in emulation mode and it seems to work fine (almost). I have a slight problem because the emulator seems to execute each thread to the end before it executes the next thread and my code requires each thread to sync at certain points for coalesced memory loads to the shared memory.
I’ve included my relevant code below.
Thanks.
p.s. also, I’m getting a lot of incoherent loads in the emulation mode. Anybody have any idea why that might be?
============
Host:
// Allocate memory to store pointers to arrays.
cudaMalloc( (void**) &Ad_values, (numRows*sizeof(DTYPE*)) );
cudaMalloc( (void**) &Ad_colidx, (numRows*sizeof(int*)) );
// Allocate memory for each array.
for(i=0;i<numRows;i++) {
// compute how much data to allocate
tmp1 = 0;
tmp2 = (i*maxNumBlocks);
for(j=0;j<A_num_blk_per_row[i];j++) {
tmp1 = tmp1 + A_max_rowsize_per_row[tmp2];
tmp2++;
}
// allocate memory on the device.
cudaMalloc( (void**) &(Ad_values[i]), (tmp1*BLOCK_SIZE_Y*sizeof(DTYPE)) );
cudaMalloc( (void**) &(Ad_colidx[i]), (tmp1*BLOCK_SIZE_Y*sizeof(int)) );
}
// transfer the data from the host to the device.
for(i=0;i<numRows;i++) {
// compute how much data to send.
tmp1 = 0;
tmp2 = (i*maxNumBlocks);
for(j=0;j<A_num_blk_per_row[i];j++) {
tmp1 = tmp1 + A_max_rowsize_per_row[tmp2];
tmp2++;
}
cudaMemcpy( Ad_values[i], A_values_reformat[i], (tmp1*BLOCK_SIZE_Y*sizeof(DTYPE)), cudaMemcpyHostToDevice);
cudaMemcpy( Ad_colidx[i], A_colidx_reformat[i], (tmp1*BLOCK_SIZE_Y*sizeof(int)), cudaMemcpyHostToDevice);
}
SpMVBlock_kernel<<<dimGrid, dimBlock>>>(Ad_values, Ad_colidx, Xd, Yd, m);
Kernel:
float* baseAddr1;
int* baseAddr2;
bid = blockIdx.x;
tid = threadIdx.x;
baseAddr1 = Ad_values[bid];
baseAddr2 = Ad_colidx[bid];
Yd[(bid*BLOCK_SIZE_Y)+tid] = baseAddr1[tid];
============