Question about Block and Thread Organization dimBlock.x, dimBlock.y, dimGrid, dimBlock

Hello everyone,

i am new to CUDA C GPGPU programming and found an example in the following pdf-file:

http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf the example is on page 22.

I also copied the code in this thread.

const int N = 1024;

const int blocksize = 16;

__global__ void add_matrix( float* a, float *b, float *c, int N )

{

     int i = blockIdx.x * blockDim.x + threadIdx.x;

     int j = blockIdx.y * blockDim.y + threadIdx.y;

     int index = i + j*N;

     if ( i < N && j < N )

          c[index] = a[index] + b[index];

     }

int main() {

     float *a = new float[N*N];

     float *b = new float[N*N];

     float *c = new float[N*N];

     for ( int i = 0; i < N*N; ++i ) {

          a[i] = 1.0f; b[i] = 3.5f; 

     }

     float *ad, *bd, *cd;

     const int size = N*N*sizeof(float);

cudaMalloc( (void**)&ad, size );

     cudaMalloc( (void**)&bd, size );

     cudaMalloc( (void**)&cd, size );

cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );

     cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, blocksize );

     dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );

     add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

     cudaFree( ad ); cudaFree( bd ); cudaFree( cd );

delete[] a; delete[] b; delete[] c;

return EXIT_SUCCESS;

}

I do understand everything but not the give block and grid parameters.

dim3 dimBlock( blocksize, blocksize ); // Means to me: 16*16 = 256 Threads per block. Is that right?

dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); // Means to me: N/dimBlock.x = 1024/16 = 64 and N/dimBlock.y = 64 → 64*64 = 4096 Blocks per grid. right?

But why such a big 2D grid? I would have 256*4096 = 1,048,576 threads with that grid. But my problemsize is only 1024?! So why did the author chose such a big grid or did i understand something wrong in his calculation?!

Hope you can bring some light in this darkness External Image

Your problem size is [font=“Courier New”]N*N[/font], not [font=“Courier New”]N[/font].

Oh, yes lost second dimension External Image
ok then i makes sense why i have so much blocks … External Image

thx

Ok one more question about the CUDA-specific thread/block parameters:

In my example i replaced the following sequentiel code, with the above CUDA parallel code:

void add_matrix (float* a, float* b, float* c, int N) {

    int index;

    for (int i = 0; i < N; i++)

        for (int j = 0; j < N; j++) {

            index = i + j*N;

            c[index] = a[index] + b[index];

        }

    }

}

ok, there i had two for-loops running from 0-1024 each. In the CUDA code they are replaced with this paramters:

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

So, how are they replaced at runtime so that i get 0-1024 back?

blockDim.x and blockDim.y should be 16 because of the kernel call, right? The dimension of one block is 2D with 16*16 = 256 threads each. So threadIdx.x and .y would be 0-16, right?

blockIdx.x and .y would be 0-64 each, because of the kernel call, still right?

so for for int i: 0-64 * 16 + 0-256

for blockIdx.x = 0, i would get threads 0-256 → 0 * 16 + 0-256

for blockIdx.x = 1, i would get overlapping threads → 1 * 16 + 0-256

where am i thinking wrong? :(

Indices run from 0…63 and 0…15, respectively.

ok thank you External Image

One more question tera External Image

If I launch a kernel like my example:

dim3 dimBlock( blocksize, blocksize );

    dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );

    add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

I always have to put all my threads in blocks (like here i put 256 threads in each block) and I have to put enough blocks in the grid, so all threads can be computed.

So where are the advantages/disadvantages in creating 1D, 2D or 3D blocks and 1D oder 2D grids?

Why do I have to organzize those parameters? Why cant I just tell the Kernel how much threads i will have and he organizes them by himself in appropirate blocks and so on …

I’m asking myself what strategies are there to organize threads in blocks (1D, 2D, 3D?) and blocks in the grid …

Hello,

One obvious reason is the case where you need more than 1 index, like a matrix with 4 dimensions M(i,j,k,l). With 2D and 3D grids you can get more indices. Second there is a limit on the number of blocks you cam submit. If you have more than 65000 you have to use a 2D grid of blocks.

Check this page CUDA - Wikipedia

Ok i can get more indices, i understand.
But is there a difference, if i need 4096 thread and i use
2D blocks with 512 * 8 threads
or
3D blocks with 16 * 16 *16?

is there a difference in the various organisations?!

Well you can not have 4096 per block on any device yet. The max is 1024. 1D blocks is better unless you need indices and use 2D grids if you have more than 65000 blocks.