Question about Block and Thread Organization dimBlock.x, dimBlock.y, dimGrid, dimBlock

JoDo · March 17, 2012, 5:09pm

Hello everyone,

i am new to CUDA C GPGPU programming and found an example in the following pdf-file:

http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf the example is on page 22.

I also copied the code in this thread.

const int N = 1024;

const int blocksize = 16;

__global__ void add_matrix( float* a, float *b, float *c, int N )

{

     int i = blockIdx.x * blockDim.x + threadIdx.x;

     int j = blockIdx.y * blockDim.y + threadIdx.y;

     int index = i + j*N;

     if ( i < N && j < N )

          c[index] = a[index] + b[index];

     }

int main() {

     float *a = new float[N*N];

     float *b = new float[N*N];

     float *c = new float[N*N];

     for ( int i = 0; i < N*N; ++i ) {

          a[i] = 1.0f; b[i] = 3.5f; 

     }

     float *ad, *bd, *cd;

     const int size = N*N*sizeof(float);

cudaMalloc( (void**)&ad, size );

     cudaMalloc( (void**)&bd, size );

     cudaMalloc( (void**)&cd, size );

cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );

     cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, blocksize );

     dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );

     add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

     cudaFree( ad ); cudaFree( bd ); cudaFree( cd );

delete[] a; delete[] b; delete[] c;

return EXIT_SUCCESS;

}

I do understand everything but not the give block and grid parameters.

dim3 dimBlock( blocksize, blocksize ); // Means to me: 16*16 = 256 Threads per block. Is that right?

dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); // Means to me: N/dimBlock.x = 1024/16 = 64 and N/dimBlock.y = 64 → 64*64 = 4096 Blocks per grid. right?

But why such a big 2D grid? I would have 256*4096 = 1,048,576 threads with that grid. But my problemsize is only 1024?! So why did the author chose such a big grid or did i understand something wrong in his calculation?!

Hope you can bring some light in this darkness External Image

tera · March 17, 2012, 5:14pm

Your problem size is [font=“Courier New”]N*N[/font], not [font=“Courier New”]N[/font].

JoDo · March 17, 2012, 5:24pm

Oh, yes lost second dimension External Image
ok then i makes sense why i have so much blocks … External Image

thx

JoDo · March 29, 2012, 9:00am

Ok one more question about the CUDA-specific thread/block parameters:

In my example i replaced the following sequentiel code, with the above CUDA parallel code:

void add_matrix (float* a, float* b, float* c, int N) {

    int index;

    for (int i = 0; i < N; i++)

        for (int j = 0; j < N; j++) {

            index = i + j*N;

            c[index] = a[index] + b[index];

        }

    }

}

ok, there i had two for-loops running from 0-1024 each. In the CUDA code they are replaced with this paramters:

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

So, how are they replaced at runtime so that i get 0-1024 back?

blockDim.x and blockDim.y should be 16 because of the kernel call, right? The dimension of one block is 2D with 16*16 = 256 threads each. So threadIdx.x and .y would be 0-16, right?

blockIdx.x and .y would be 0-64 each, because of the kernel call, still right?

so for for int i: 0-64 * 16 + 0-256

for blockIdx.x = 0, i would get threads 0-256 → 0 * 16 + 0-256

for blockIdx.x = 1, i would get overlapping threads → 1 * 16 + 0-256

where am i thinking wrong? :(

tera · March 29, 2012, 11:00am

Indices run from 0…63 and 0…15, respectively.

JoDo · March 29, 2012, 5:53pm

ok thank you External Image

JoDo · April 21, 2012, 1:38pm

One more question tera External Image

If I launch a kernel like my example:

dim3 dimBlock( blocksize, blocksize );

    dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );

    add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

I always have to put all my threads in blocks (like here i put 256 threads in each block) and I have to put enough blocks in the grid, so all threads can be computed.

So where are the advantages/disadvantages in creating 1D, 2D or 3D blocks and 1D oder 2D grids?

Why do I have to organzize those parameters? Why cant I just tell the Kernel how much threads i will have and he organizes them by himself in appropirate blocks and so on …

I’m asking myself what strategies are there to organize threads in blocks (1D, 2D, 3D?) and blocks in the grid …

pasoleatis · April 21, 2012, 8:44pm

One more question tera External Image

If I launch a kernel like my example:
dim3 dimBlock( blocksize, blocksize );

    dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );

    add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N );
I always have to put all my threads in blocks (like here i put 256 threads in each block) and I have to put enough blocks in the grid, so all threads can be computed.

So where are the advantages/disadvantages in creating 1D, 2D or 3D blocks and 1D oder 2D grids?

Why do I have to organzize those parameters? Why cant I just tell the Kernel how much threads i will have and he organizes them by himself in appropirate blocks and so on …

I’m asking myself what strategies are there to organize threads in blocks (1D, 2D, 3D?) and blocks in the grid …

Hello,

One obvious reason is the case where you need more than 1 index, like a matrix with 4 dimensions M(i,j,k,l). With 2D and 3D grids you can get more indices. Second there is a limit on the number of blocks you cam submit. If you have more than 65000 you have to use a 2D grid of blocks.

Check this page CUDA - Wikipedia

JoDo · April 21, 2012, 9:25pm

Ok i can get more indices, i understand.
But is there a difference, if i need 4096 thread and i use
2D blocks with 512 * 8 threads
or
3D blocks with 16 * 16 *16?

is there a difference in the various organisations?!

pasoleatis · April 22, 2012, 8:09am

Well you can not have 4096 per block on any device yet. The max is 1024. 1D blocks is better unless you need indices and use 2D grids if you have more than 65000 blocks.

Topic		Replies	Views
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1954	January 26, 2011
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
Block size and grid size CUDA Programming and Performance	5	8361	April 27, 2009
grid size, block size CUDA Programming and Performance	2	24875	February 5, 2009
About grid size and performance CUDA Programming and Performance	10	2396	June 25, 2010
Questions about Block and Grid CUDA Programming and Performance	4	3542	February 26, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10593	December 24, 2008
CUDA Program Issue CUDA Programming and Performance cuda	19	140	September 20, 2024
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18102	October 17, 2013
Particles, threads and blocks CUDA Programming and Performance	4	2439	June 25, 2008

Question about Block and Thread Organization dimBlock.x, dimBlock.y, dimGrid, dimBlock

Related topics