Optimizing 2-D CUDA code

I have a 2-D iterating code that works, but very, very slowly (I timed it and it seems to be operating at several hundred kiloflops, maybe a megaflop). Does anyone see something wrong with it, or is the fault perhaps with the computer running it?

include <stdio.h>

include <stdlib.h>

include <cuda.h>

global void FILENAME(float VAR_device, float ANS_device, size_t pitch_A, size_t pitch_B, unsigned int stride, int N)
{
int x = blockIdx.x
blockDim.x+threadIdx.x;
int y = blockIdx.y
blockDim.y+threadIdx.y;
for (x=0; x<1; x++)
{
for (y=0; y<N; y++)
{
ANS_device[ystride + x] = VAR_device[ystride + x];
}
}
for (x=N-1; x<N; x++)
{
for (y=0; y<N; y++)
{
ANS_device[ystride + x] = VAR_device[ystride + x];
}
}
for (x=1; x<N-1; x++)
{
for (y=0; y<N; y++)
{
ANS_device[ystride + x] = 0.25VAR_device[ystride + x - 1] + 0.5VAR_device[ystride + x] + 0.25VAR_device[y*stride + x + 1];
}
}
}

int main()
{
int N = 16;

float *ANS_device, VAR_device;
size_t size = N
sizeof(float);

float VAR_host[N][N], ANS_host[N][N];

size_t pitch_A, pitch_B;
cudaMallocPitch((void **)(&ANS_device), &pitch_A, size, N);
cudaMallocPitch((void **)(&VAR_device), &pitch_B, size, N);

unsigned int stride;
stride = pitch_A/sizeof(float);

for (int j=0; j<1; j++)
{
for (int i=0; i<N; i++)
{
VAR_host[i][j] = 300;
}
}
for (int j=1; j<N; j++)
{
for (int i=0; i<N; i++)
{
VAR_host[i][j] = 0;
}
}

cudaMemcpy2D(ANS_device, pitch_A, VAR_host, size, size, N, cudaMemcpyHostToDevice);

dim3 dimBlock(N/2, N/2);
dim3 nBlocks(2, 2);

int nIterations = 5000;
for(int k=0; k<nIterations; k++)
{
FILENAME <<< nBlocks, dimBlock >>> (ANS_device, VAR_device, pitch_A, pitch_B, stride, N);
float *temp = ANS_device;
ANS_device = VAR_device;
VAR_device = temp;
}

cudaMemcpy2D(ANS_host, size, VAR_device, pitch_B, size, N, cudaMemcpyDeviceToHost);

for (int i=0; i<N; i++)
{
for (int j=0; j<N; j++)
{
printf(“%f “, ANS_host[i][j]);
}
printf(”\n”);
}

cudaFree(VAR_device);
cudaFree(ANS_device);

return 0;
}

Thanks in advance for any replies.

Two observations:

  1. You are processing a 16x16 matrix and launching 4 blocks of 64 threads, which is not enough work to keep even a very modest capacity CUDA GPU occupied to a sufficient degree. You really can’t expect to reach anything like peak performance with less two blocks per multiprocessor and a thread count approaching the upper limit of threads per block.

  2. Your memory access patterns are totally uncoalesced, which is going to make your kernel very, very slow.

You should probably have a look at Chapter 5 of the programming guide, which discusses guidelines for achieving optimal performance. There was some good material posted here on optimization that you might want to look at as well.

Well, that points me in the right direction. As if you couldn’t tell, I’m a novice with CUDA, so it’s a relief to know that my code isn’t completely wrong–just partly. Thanks very much for that quick reply.