A question about 3D grid

Hi, everyone,

I found that I cannot use the third dimension of a grid on my Tesla C2050 GPU card with Compute Capability 2.0, which means the maximum sizes of each dimension of a grid: 65535 x 65535 x 65535.

I just ran a simple test to see if I can use the 3rd dimension of a grid, like this:

using namespace std;

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>

#include <cutil_inline.h>

global void test(float* d_t)
if (threadIdx.x == 0 && blockIdx.x==0 && blockIdx.y==0 && blockIdx.z==0)

int main( int argc, char** argv)
if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )
cutilDeviceInit(argc, argv);
cudaSetDevice( cutGetMaxGflopsDeviceId() );

float d_t=NULL;
cutilSafeCall( cudaMalloc( (void
*) &d_t, sizeof(float)) );

float h_t[1];
cutilSafeCall(cudaMemcpy(d_t, h_t, sizeof(float), cudaMemcpyHostToDevice) );

dim3 test_blocks(2,2,2);
dim3 test_threads(64);

test<<< test_blocks, test_threads >>>(d_t);
cutilCheckMsg(“Kernel execution failed”);

cutilSafeCall( cudaMemcpy( h_t, d_t, sizeof(float), cudaMemcpyDeviceToHost) );



In kernel, if I set

if (threadIdx.x == 0 && blockIdx.x==0 && blockIdx.y==0 && blockIdx.z==0)

then I can get h_t=1; however, if I set

if (threadIdx.x == 0 && blockIdx.x==0 && blockIdx.y==0 && blockIdx.z==1)

here, the only difference is blockIdx.z==1, then I get the results h_t=0.

It seems that the third dimension of a grid should be 1 (or the blockIdx.z should always be 0), which is contradict to the specification of compute capacity 2.0, where the the maximum size of z dimension of a grid is 65535.

Does anyone have ideas about this? Thanks in advanced!

More info:

I ran the GPU code on CentOS 5.5, and the compiling command I used is:
nvcc test3Dgrid.cu -o t3g_gpu -I /home/huangchao/NVIDIA_GPU_Computing_SDK/C/common/inc -lcutil_x86_64 -L /home/huangchao/NVIDIA_GPU_Computing_SDK/C/lib


Which CUDA version are you using?

Hi, Tera, thanks for your reply, and from the deviceQuery, it gave:

Device 0: “Tesla C2050”

CUDA Driver Version: 4.0

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 2817720320 bytes

Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: Yes

Device has ECC support enabled: Yes

Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 3.20, NumDevs = 1, Device = Tesla C2050

Just to be sure, add -arch=sm_20 when you compile. nvcc compile for compute capability 1.0 by default, which does not support 3D grids of blocks.

You need to install CUDA runtime 4.0.

After I upgrade my CUDA driver, toolkit and SDK to the latest version, and add -arch sm_20 into compile line, then it works.

Thank you guys for your help! I really appreciate it!