Hello, I’m new to the forum and CUDA as well. I’m trying to solve the 3D diffusion problem in one of my projects. Below is my kernel. I tested it with a very simple 3 x 3 x 3 cube with the [1][1][1] element = 1.0. Everything else is 0.0. After 1 iteration, the output looks like this:

0
0
4.21299e+06
0
0.1
5.55869e+06
0
0
6.25582e+06
0
0.1
6.87205e+06
0.1
0.5
7.46127e+06
0
0.1
8.06317e+06
0
0
1.30519e+07
0
0
1.49227e+07
0
0
2.03995e+07

while the correct output should look like:
0
0
0
0
0.1
0
0
0
0
0
0.1
0
0.1
0.4
0.1
0
0.1
0
0
0
0
0
0.1
0
0
0
0

Also, here is how I launched the kernel: diffusion<<<2, BLOCK_SIZE>>>(d_matrixIn, d_matrixOut, numRows, numCols, layers, ce, cw, cn, cs, ct, cb, cc);

global void diffusion(float *f1, float * f2, int nx, int ny, int nz,
float ce, float cw, float cn, float cs, float ct, float cb, float cc)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int c = x + y * nx;
int xy = nx * ny;
// int j,jz,je,jn,jb,jw,js,jt;
for (int k = 0; k< nz; ++k)
{
int w = (x == 0) ? c : c - 1;
int e = (x == nx - 1) ? c : c + 1;
int s = (y == 0) ? c : c - nx;
int n = (y == ny - 1) ? c : c + nx;
int b = (k == 0) ? c : c - xy;
int t = (k == nz - 1) ? c : c + xy;

``````            f2[c] = cc * f1[c] + cw * f1[w] + ce * f1[e] + cs * f1[s] + cn * f1[n] + cb * f1[b] + ct * f1[t];

c += xy;
}
``````

}

At first glance it looks like your launch configuration is wrong. You use BLOCK_SIZE but do not show what it represents.

In other words it looks like you are launching a thread block with the grid only in the X-dimension being greater than 1 (1 is default for y and z). So threadIdx.y is only ever 0 and as is blockIdx.y.

There probably are other errors, but it looks like that is one of them.

Use code blocks next time.

Thanks a lot. My BLOCK_SIZE is defined as 16. Can you please explain a little more about the dimensions?
I fixed the launch at <<<2, BLOCK_SIZE>>> only for this particular case (because there are only 27 points).

A super simple version would be something like this:

``````const int num_points=27;
#define BLOCK_SIZE 32

...in kernel{

int x = blockDim.x * blockIdx.x + threadIdx.x;//0 to 31
int y = blockIdx.y;//0 to 26

if(x<num_points){

}
}

for launch from host:

dim3 grid((num_points+BLOCK_SIZE-1)/BLOCK_SIZE,num_points,1);
diffusion<<<grid,BLOCK_SIZE>>>(..
``````

Or you could treat y the same way as x, but you need to use a 2-D threads dim3 variable rather than a constant scalar. There are a great number of ways to implement this as the number of points gets larger, though this is enough to answer your question.
NOTE: I did not check this to verify, but the general idea should be clear. Google other CUDA code for examples

It works. Thanks a bunch.