ThreadIdx vs BlockIdx Performance

I am generating 4D-array using Cuda. But When my kernel is

__global__ void getIndex(float* A, int w, int h, int d, int t)
{
  int i = blockIdx.x;
  int j = blockIdx.y;
  int k = blockIdx.z;
  A[t*h*w*d + h*w*i + w*j + k] = h*w*i+ w*j+ k;
  printf("kernel value = %f\n",A[t*h*w*d + h*w*i + w*j + k]);
}

It is compiling and working but after some changes, it is not working, throwing bundles of errors.

__global__ void getIndex(float* A, int w, int h, int d, int t)
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  int j = threadIdx.y + blockIdx.y * blockDim.y;
  int k = threadIdx.z + blockIdx.z * blockDim.z;
  A[t*h*w*d + h*w*i + w*j + k] = h*w*i+ w*j+ k;
  printf("kernel value = %f\n",A[t*h*w*d + h*w*i + w*j + k]);
}

My Main Program is as follows

int main() 
{

	unsigned w = 5, h = 6, d = 5, nt = 7 ;
	
	float *d_A;
	cudaMalloc((void **)&d_A, nt * w * h * d * sizeof(float) );
	for (int t = 0 ; t < nt ; t++)
	{
		
		dim3 blockDim(8, 8, 1);
		dim3 gridSize(((w+blockDim.x)/blockDim.x),((h+blockDim.y)/blockDim.y),((d+blockDim.z)/blockDim.z));
		getIndex <<<gridSize, blockDim >>> (d_A, w, h, d, t);
	}		


    float *h_data = (float *) malloc(nt * w * h * d * sizeof(float));
    cudaMemcpy(h_data, d_A, nt * w * h * d * sizeof(float), cudaMemcpyDeviceToHost); 
 
  
	return 0 ;
}

And I want to use multi-node multi-GPU for this task. I am unable to make it generic.
I know to use multi-GPU for a 1D array, but I am not getting for the multi-dimensional array while using dim3.
https://github.com/Newbie-Coder-1105/NoviceChannel/tree/master/Cuda-Practice/multi_GPU_practice

Is there any way to perform such task for 3d array while using

dim3
cudaMalloc3DArray
texture memory for the 3d array

Why am I getting errors while using

threadIdx.x + blockIdx.x * blockDim.x

, in global kernel?

How could I make it multi-node multi-GPU, without knowing the number of GPUs in the cluster system?

Your grid sizing is launching more threads than you need:

dim3 blockDim(8, 8, 1);
		dim3 gridSize(((w+blockDim.x)/blockDim.x),((h+blockDim.y)/blockDim.y),((d+blockDim.z)/blockDim.z));

Therefore, without any safeguards, your kernel will access the d_A array out-of-bounds. You need a thread “check” in the kernel like this:

int i = threadIdx.x + blockIdx.x * blockDim.x;
  int j = threadIdx.y + blockIdx.y * blockDim.y;
  int k = threadIdx.z + blockIdx.z * blockDim.z;
  if ((i < w) && (j < h) && (k < d)){
  A[t*h*w*d + h*w*i + w*j + k] = h*w*i+ w*j+ k;
  printf("kernel value = %f\n",A[t*h*w*d + h*w*i + w*j + k]);
  }

If you want to split the work of this type of 3D kernel across multiple kernel launches (e.g. one for each GPU) just divide up your array logically on the t direction, and assign some slices to each kernel launch, with an appropriate offset for each kernel launch.

Thanks a lot.

If you any example for multi-node -multi-gpu code then it will be very helpful.

multi GPU cuda sample codes:

conjugateGradientMultiDeviceCG
MonteCarloMultiGPU
topologyQuery
simpleCUFFT_MGPU
simpleCUFFT_2d_MGPU
cudaOpenMP
simpleMultiGPU
simpleP2P
p2pBandwidthLatencyTest
nbody

for multi-node, just use a distribution/communication strategy you are familiar with, e.g. MPI
Conceptually, dividing the work between 2 GPUs in the same node is not much different than dividing the work between 2 GPUs in separate nodes. The difference will be in the node-to-node communication, and that can be pretty much orthogonal to CUDA, especially for beginners.

There is a CUDA MPI sample code.