Why to use three dimensional threads Thread dimension

Hello All,

Can anybody help to understand, why to use three dimensional threads?

If i am able to do some job in single dimension, why should i do three dimensional tasks?

Like for matrix multiplication exmaple has been done in two dimension something like this -

global void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)

// Pvalue is used to store the element of the matrix
// that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; ++k) 
   int Melement = Md[threadIdx.y*Width+k];
   int Nelement = Nd[k*Width+threadIdx.x];
   Pvalue += Melement * Nelement;

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;

with grid, block defns -

dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

But i can also do the same task by defining a single ddimensional block size -

dim3 blockDim(Width);
dim3 gridDim(Width);

and with kernel code -

global void matrixmultiplicationKernel(int *d_a,int *d_b,int M,int N,int P,int *d_f)
int j = blockIdx.x * blockDim.x + threadIdx.x;

// Perform multiplication
float sum = 0.0f;

for(int k = 0; k < N; ++k)
	sum += d_a[k+blockIdx.x * blockDim.x]* d_b[ blockDim.x*k+threadIdx.x];

d_f[j] = sum;


Which is total a single dimensional block doing same task.

Please help me understand why dimensions are necessary.

Thanks in Advance,

Just helps certain types of 3D problems…Dats all…

If u dont need it, dont use it…

I never use 2D or 3D blocks.
If you have a lot of device functions or methods in objects, they are either

  • prepared for 1D operation only
  • have some compute overhead to check dimentionality of the kernel

I wish there was a neat way to flatten the kernel when needed. The only way to do it, that I am aware of, is to simply compute and store the flat threadID in a variable and pass it to every single function. But that is ugly!
Maybe you have some idea or experience in this matter?

The performance of your code should generally be totally independent of your block dimensions. It only depends on the number of threads you have chosen for each block. The SMs dont care…

Having 2D & 3D blocks is just an abstraction that can be helpful to you as a programmer.

But code for some basic device functions may depend. Consider for example the following block-wise copy:

__device__ void memCopy(int *destination, int *source, int size) {

  for (int i=threadIdx.x; i<size; i+=blockDim.x)



Useful, simple, fast. However works with 1D blocks only. If we want to support 2D and 3D cases you would have to implement:

__device__ void memCopy(int *destination, int *source, int size) {

  for (int i=threadIdx.x+blockDim.x*threadIdx.y+blockDim.x*blockDim.y*t

hreadIdx.x; i<size; i+=blockDim.x*blockDim.y*blockDim.z)



The second implementation will work in every case, but if you use it in 1D block, it will contain unnecessary multiplications inside and consume more registers than needed.