Diffusion kernel not working. Please help!

Hello, I’m new to the forum and CUDA as well. I’m trying to solve the 3D diffusion problem in one of my projects. Below is my kernel. I tested it with a very simple 3 x 3 x 3 cube with the [1][1][1] element = 1.0. Everything else is 0.0. After 1 iteration, the output looks like this:

0
0
4.21299e+06
0
0.1
5.55869e+06
0
0
6.25582e+06
0
0.1
6.87205e+06
0.1
0.5
7.46127e+06
0
0.1
8.06317e+06
0
0
1.30519e+07
0
0
1.49227e+07
0
0
2.03995e+07

while the correct output should look like:
0
0
0
0
0.1
0
0
0
0
0
0.1
0
0.1
0.4
0.1
0
0.1
0
0
0
0
0
0.1
0
0
0
0

Also, here is how I launched the kernel: diffusion<<<2, BLOCK_SIZE>>>(d_matrixIn, d_matrixOut, numRows, numCols, layers, ce, cw, cn, cs, ct, cb, cc);

Please help! I’d really appreciate it.

global void diffusion(float *f1, float * f2, int nx, int ny, int nz,
float ce, float cw, float cn, float cs, float ct, float cb, float cc)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int c = x + y * nx;
int xy = nx * ny;
// int j,jz,je,jn,jb,jw,js,jt;
for (int k = 0; k< nz; ++k)
{
int w = (x == 0) ? c : c - 1;
int e = (x == nx - 1) ? c : c + 1;
int s = (y == 0) ? c : c - nx;
int n = (y == ny - 1) ? c : c + nx;
int b = (k == 0) ? c : c - xy;
int t = (k == nz - 1) ? c : c + xy;

            f2[c] = cc * f1[c] + cw * f1[w] + ce * f1[e] + cs * f1[s] + cn * f1[n] + cb * f1[b] + ct * f1[t];

            c += xy;
    }

}

At first glance it looks like your launch configuration is wrong. You use BLOCK_SIZE but do not show what it represents.

In other words it looks like you are launching a thread block with the grid only in the X-dimension being greater than 1 (1 is default for y and z). So threadIdx.y is only ever 0 and as is blockIdx.y.

There probably are other errors, but it looks like that is one of them.

Use code blocks next time.

Thanks a lot. My BLOCK_SIZE is defined as 16. Can you please explain a little more about the dimensions?
I fixed the launch at <<<2, BLOCK_SIZE>>> only for this particular case (because there are only 27 points).

A super simple version would be something like this:

const int num_points=27;
#define BLOCK_SIZE 32

...in kernel{

int x = blockDim.x * blockIdx.x + threadIdx.x;//0 to 31
int y = blockIdx.y;//0 to 26

if(x<num_points){

}
}

for launch from host:

dim3 grid((num_points+BLOCK_SIZE-1)/BLOCK_SIZE,num_points,1);
diffusion<<<grid,BLOCK_SIZE>>>(..

Or you could treat y the same way as x, but you need to use a 2-D threads dim3 variable rather than a constant scalar. There are a great number of ways to implement this as the number of points gets larger, though this is enough to answer your question.
NOTE: I did not check this to verify, but the general idea should be clear. Google other CUDA code for examples

It works. Thanks a bunch.