I am trying to implement what I hoped would be a simple routine, which is to copy the upper triangle of a SQUARE matrix (excluding the diagonal) in to the lower triangle of the same matrix. I watched the introductory CUDA presentation, and also have some basic knowledge of getting kernels, etc., working. My kernel looks like this:

```
__global__ void copy_upper_to_lower_kernel(int dimx, int dimy, int* a)
{
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
if (iy > ix)
{
int id_dest = iy * dimy + ix;
int id_src = ix * dimx + iy;
a[id_dest] = a[id_src];
}
}
```

What this appears to give me, however, is a matrix of zeros. I’ve tested that the kernel works in a function sense by replacing a[id_dest] = 7, and I get a lower triangular matrix full of 7s.

Any help you can offer is gratefully received.