Copy the upper triangle of a matrix to the lower triangle

I am trying to implement what I hoped would be a simple routine, which is to copy the upper triangle of a SQUARE matrix (excluding the diagonal) in to the lower triangle of the same matrix. I watched the introductory CUDA presentation, and also have some basic knowledge of getting kernels, etc., working. My kernel looks like this:

__global__ void copy_upper_to_lower_kernel(int dimx, int dimy, int* a)

{

	int ix = blockIdx.x * blockDim.x + threadIdx.x;

	int iy = blockIdx.y * blockDim.y + threadIdx.y;

	

	if (iy > ix)

	{

	

		int id_dest = iy * dimy + ix;

	

		int id_src = ix * dimx + iy;

	

		a[id_dest] = a[id_src];

	}

}

What this appears to give me, however, is a matrix of zeros. I’ve tested that the kernel works in a function sense by replacing a[id_dest] = 7, and I get a lower triangular matrix full of 7s.

Any help you can offer is gratefully received.

Here is the output. Note that the “before” matrix is filled with “7” in the lower diagonal, to rule out the possibility that I’m copying the lower diagonal in to the upper diagonal.

Before
1234
7234
7734
7774

After
0000
0000
0000
0000

I had the parameters around the wrong way in the call to cudaMemcpy. I was copying the newly allocated and empty device memory in to host memory. All fixed.