trouble indexing 1D->3D array in my kernel


I am writing a CUDA kernel to create a 3x3 covariance matrix for each location in the rowscols main matrix. So that 3D matrix is rowscols*9 in size, which i allocated in a single malloc accordingly. I need to access this in a single index value

the 9 values of the 3x3 covariance matrix get their values set according to the appropriate row r and column c from some other 2D arrays.

In other words - I need to calculate the appropriate index to access the 9 elements of the 3x3 covariance matrix, as well as the row and column offset of the 2D matrices that are inputs to the value, as well as the appropriate index for the storage array.

i have tried to simplify it down to the following:

extern “C” cuComplex* calc_cov_wrapper(int rows, int cols){
//set up kernel dimensions
int TILE_WIDTH = 512;
int num_blocks_x = (int)(cols+TILE_WIDTH)/TILE_WIDTH;
int num_blocks_y = rows/3;
dim3 dimBlock(TILE_WIDTH, 1,9);
dim3 dimGrid(num_blocks_x, num_blocks_y);

int memsize = (rows/3)*cols*9*sizeof(cuComplex);
cutilSafeCall( cudaMalloc((void **)&d_cov,memsize));
h_cov = (cuComplex *)malloc(rows*cols*9*sizeof(cuComplex));

calc_covMatrix<<<dimGrid, dimBlock>>>(d_cov, d_buffer1, d_buffer3, d_buffer2, (rows/3), cols, TILE_WIDTH, 0, memsize);
cutilSafeCall(cudaMemcpy(&h_cov[0], d_cov, memsize, cudaMemcpyDeviceToHost));

//calc 2nd third
calc_covMatrix<<<dimGrid, dimBlock>>>(d_cov, d_buffer1, d_buffer3, d_buffer2, (rows/3), cols, TILE_WIDTH, (rows/3), memsize);
cutilSafeCall(cudaMemcpy(&h_cov[rows/3], d_cov, memsize, cudaMemcpyDeviceToHost));

//calc last third
memsize = (rows/3 + rows%3)*cols*9*sizeof(cuComplex);
cutilSafeCall( cudaMalloc((void **)&d_cov,memsize));
num_blocks_y = rows/3 + rows%3;
dim3 dimGrid3(num_blocks_x, num_blocks_y);
calc_covMatrix<<<dimGrid3, dimBlock>>>(d_cov, d_buffer1, d_buffer3, d_buffer2, (rows/3) + (rows%3), cols, TILE_WIDTH, 2*(rows/3), memsize);
cutilSafeCall(cudaMemcpy(&h_cov[2*rows/3], d_cov, memsize, cudaMemcpyDeviceToHost));
return h_cov;


global void calc_covMatrix(cuComplex* d_cov, cuComplex* d_buffer1,cuComplex* d_buffer3,cuComplex* d_buffer2, int rows, int cols, int TILE_WIDTH, int rowoffset, int dcovSize)
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
int r = rowoffset + by + ty;
int c = bxTILE_WIDTH + tx;
int offset = r

int ndx = (tz*cols+c)*9 + r;

if((r < rows) && (c < cols) && (ndx < dcovSize)){
cuComplex hhConj = cuConjf(d_buffer1[offset]);
cuComplex vvConj = cuConjf(d_buffer3[offset]);
const float sqrt2 = 1.414213562f;
cuComplex two_root = make_cuComplex(sqrt2, 0.0);
case 0:
d_cov[ndx] = make_cuComplex(cuCabsf(d_buffer1[offset])*cuCabsf(d_buffer1[offset]), 0.0);
case 1:

        	d_cov[ndx] = cuCmulf(two_root, cuCmulf(d_buffer1[offset], hhConj));
	case 2:
		d_cov[ndx] = cuCmulf(d_buffer1[offset], vvConj);
	case 3:
		d_cov[ndx] = cuCmulf(two_root, cuCmulf(d_buffer2[offset], hhConj));
	case 4:
		d_cov[ndx] = make_cuComplex(cuCabsf(d_buffer2[offset])*cuCabsf(d_buffer2[offset]), 0.0);
	case 5:
		d_cov[ndx] = cuCmulf(two_root, cuCmulf(d_buffer2[offset], vvConj));
	case 6:
		d_cov[ndx] = cuCmulf(d_buffer3[offset], hhConj);
	case 7:
		d_cov[ndx] = cuCmulf(two_root, cuCmulf(d_buffer3[offset], cuConjf(d_buffer2[offset])));

	case 8:
		d_cov[ndx] = make_cuComplex(cuCabsf(d_buffer3[offset])*cuCabsf(d_buffer3[offset]), 0.0);


I have added the full host and kernel code here. As you can see, I am trying to split the problem into 3rds based on the number of rows, because the full amount will not allocate on the GPU.

Any help is very much appreciated! I Basically can’t seem to figure out what my ndx values need to be within the kernel, or what they need to be in the host array returned by the wrapper. Are the GPU and CPU both memory aligned the same way? That is - they are both row major?

I see one error here is that I have indexed wrong the h_cov array for the cudaMemcpy that are happening. I am not sure how to set up the indexing for copying that so that it will work correctly and do it per 1/3 of the problem. Also, the results I am getting from this are giving me a heavy distorted image, but I cant tell if the calculations are incorrect, or if the memory is just not aligned as it should. I am leaning towards the latter. I also wonder - could this possibly be some kind of pitch issue? I am obviously just using the linear malloc to get the data into the GPU, but perhaps that is messing things up since it is an array of structs?

So I have been working on this and I have proved something to myself:

int ndx = blockDim.yblockDim.zby + blockDim.xblockDim.zc;

This line will calculate the correct index in the 3D array, to map to CPU 3D array defined as a[i][j][k] when using this macro to access the 1D version: #define N(x, y, z) ((xcolsdepth) + (depth*y) +z) where cols is the number of columns in the Y direction, and depth is the depth in the Z direction

However, in my kernel, I cannot for the life of me figure out how to make a (row,column) map to 2D array, when I want to do something like

3Darray[ndx] = 2darray[otherindex];

How can I figure out what the index is going to be in the 2D array, when I started my kernel with the parameters like so for my rowsxcols*9 sized 3D array:

int TILE_WIDTH=96;
int num_blocks_x = (int)(cols+TILE_WIDTH-1)/TILE_WIDTH; 
int num_blocks_y = rows;
dim3 dimBlock(TILE_WIDTH, 1, 9);
dim3 dimGrid(num_blocks_y, num_blocks_x);
long int memsize = rows*cols*9*sizeof(cuComplex);
cutilSafeCall( cudaMalloc((void **)&d_cov,memsize));
h_cov = (cuComplex *)malloc(rows*cols*9*sizeof(cuComplex));
cudaMemset(d_cov,0, memsize);
int maxIndex = rows*cols*9;

calc_covMatrix<<<dimGrid, dimBlock>>>(d_cov, rows, cols, TILE_WIDTH);

Can anyone answer how I would do this calculation in my kernel: 3Darray[ndx] = 2darray[otherindex]; where otherindex corresponds to the same row and column that was mapped in the ndx calculation?