Max Dimension of GridSize and BlockSize

Hi@all,

I have a question concering the dimension of blocksize and gridsize. Why I’m not able to define

dim3 dimBlock (512,1,1);
dim3 dimGrid (1,1024,1024);

I have the following graphiccard:

CUDA Device #0
Major revision number: 2
Minor revision number: 1
Name: GeForce GT 425M
Total global memory: 1008271360
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1120000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 2
Kernel execution timeout: No
Kernel concurrent execution: Yes

Many thanks for your reply :)

I never tried so many blocks before.

Can you try

dim3 dimGrid(1024, 1024, 1);

and

dim3 dimGrid(1, 256, 256);

and tell us the result?

I configured my kernel with

dim3 dimGrid(1024, 1024, 1);

and

dim3 dimGrid(1, 256, 256);

and every time I got a first chance exception:

First-chance exception at 0x757fb727 in Test.exe: Microsoft C++ exception: cudaError at memory location 0x002fcc30…

First-chance exception at 0x757fb727 in Test.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000.

but the configuration

dim3 dimGrid(1,256,1);

works. really strange :unsure:

I thought only the number of threads per block are limited, e.g. (blockDim.x * blockDim.y * blockDim.z) < 1024

It’s probably not the grid size that’s causing the problem. How do you allocate your host and device memory?

I allocate the host memory with:

CUDA_SAFE_CALL( cudaMallocHost( (void**) &h_A, sizeofA ));

and the device memory with

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_A, sizeofA ));

the memcopy is done with

cudaMemcpyAsync(d_A, h_A, sizeofA , cudaMemcpyHostToDevice);

only for test I only calculate in the kernel:

global void test(double* a, double* b, double* c)
{

const unsigned int tidx = blockDim.x * blockIdx.x +  threadIdx.x;
const unsigned int tidy = blockDim.y * blockIdx.y + threadIdx.y;	

const unsigned int tid =  threadIdx.x * 400 + tidy;
const unsigned int row = tidx * 400;
const unsigned int col = tidy * 400;

}

I call the function with

dim3 dimBlock(512,1,1);
dim3 dimGrid(1,1024,1024);

dot<<<dimBlock, dimGrid>>>(d_A, d_B, d_C);

First problem I see is with your calculation of index. When you do dimGrid(1,1024,1024), you are using the Y and Z dimensions but your tidx and tidy aren’t using the index in the Z dimension.

I doubt the thing you are trying to do with tidx * 400 is going to work. I don’t think you have so much memory to work on. How do you calculate the sizeofA? How do you use all the tidx, tidy and tid, row, col?

First, thanks for your reply!!

For better understanding I explain you the main idea:

In theorie

  1. I generate a 3D cube, where each element is calculated by a multiplication from the input matrices a and b

  2. After generation I reduce the cube from 3D to 2D by sum up the elements in z direction to c

In practice

concerning cuda specification I’m not allowed to specify more than 64 threads in z direction, therefore I use the x dimension to sum up the elements.

here is my code, where NMAX = 400(is the height of the cube) and height = width = 1024 (size of the matrix);

please see also my attachement

the size of a and b is NMAX * width * sizeof(double);

and the size of c = width * width * sizeof(double);

dim3 dimBlock(512,1,1);

dim3 dimGrid(1,width,width);

__global__ void test(double* a, double* b, double* c, int width)

{

	const unsigned int tidx = blockDim.x * blockIdx.x +  threadIdx.x;

	const unsigned int tidy = blockDim.y * blockIdx.y + threadIdx.y;	

	const unsigned int tidz = blockDim.z * blockIdx.z +  threadIdx.z;

	const unsigned int tid =  tidy * width + tidz;

	const unsigned int row = tidy * NMAX;

	const unsigned int col = tidz * NMAX;

	__shared__ double s_a[NMAX];

	__shared__ double s_b[NMAX];

	__shared__ double s_sum[NMAX];

	if(threadIdx.x < NMAX)

	{

		s_a[threadIdx.x] = a[row + threadIdx.x];

		s_b[threadIdx.x] = b[col + threadIdx.x];

		__syncthreads();

//calculate each element in the cube

		s_sum[threadIdx.x] = s_a[threadIdx.x] * s_b[threadIdx.x];

		__syncthreads();

		if(threadIdx.x == 0)

		{

			double val = 0;

//calculate sum

#pragma loop unrool NMAX

			for(int i=0; i<NMAX; i++)

			{

				val += s_sum[i];

			}

//store the value in the 2D matrix

			c[tid] = val;

		}

	}

}

cube.png

Your indexing is thoroughly confusing

There is no need to use the Z dimension of a grid. But if you’re just trying to do a simple multiplication of matrices of the size 1024 by 400, you’re doing it in a very wrong manner. Please check the SDK’s matrix multiplication example and the whitepaper that comes with it for more information of how to implement matrix multiplication efficiently.

thanks for your advice, I’ll take a closer study to the SDK multiplikation example