CUDA Program Issue

I have a question about CUDA program.
I created a program that adds arrays together using the GPU.
A part of kernel function in our program is as follows:

program A)
const int size = 1000*2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i >= size) return;
result[i] = array1[i] + array2[i];

return;

program B)
const int nx = 1000;
const int ny = 2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = nx * j + i;

if (i < nx && j < ny)
result[k] = array1[k] + array2[k];
return;

Result)
The addition of arrays “program A” can be performed without any problems, but the result of the addition of arrays “program B” is 0.

Is there anything inappropriate?
I would like to perform efficient GPU calculations using “int i” and “int j”.

Format your code with the </> button, please. How do you call/launch your kernels? Are the grid and block dimensions set accordingly?

Your kernel design pretty much expects a 2D grid/threadblock, but you are launching 1D grid of 1D threadblocks. There are many resources online that can help you to formulate a 2D grid/launch, here is one example.

The product of the blockShape x and y dimensions should not exceed 1024 (the exact number depending on your GPU architecture).
The grid dimensions can be large.

You could do

dim3 blockShape = dim3(32, 32);
dim3 gridShape = dim3(1048576, 1048576);

or

dim3 blockShape = dim3(1024, 1);
dim3 gridShape = dim3(1048576, 1048576);

any time you’re having difficulty with a CUDA code, I recommend proper CUDA error checking. It will be useful for others who are trying to help you, and eventually it will help you sort out problems on your own.

I changed it to the following settings and it worked correctly.

dim3 blockShape = dim3(1000, 1);
dim3 gridShape = dim3(3, 3000);

If possible, please tell me the mechanism that worked correctly.

The

Was just an example. It would run around 10^12 blocks. (Both numbers are multiplied). Or 10^15 threads.
Either it lead to a timeout or to a memory out-of-bounds error in your kernel.

Check the return values of all CUDA calls and of the cudaDeviceSynchronize().

This is an illegal shape. The first dimension can be up to 2^31-1. The second and 3rd dimensions are limited to 65535.

I encourage you to use proper CUDA error checking.

1 Like

Thank you for your cooperation.
I want to master 2D grid,
so could you please share some sample code?

#include <cassert>
template <typename T>
__global__ void AddPixel_Kernel(const T *array1, const T *array2,  T *result, const int nx, const int ny) {

	size_t i = blockIdx.x * blockDim.x + threadIdx.x; // width
	size_t j = blockIdx.y * blockDim.y + threadIdx.y; // height
	size_t k = nx * j + i;

	if (i < nx && j < ny) {
		result[k] = array1[k] + array2[k];
	}
}

template <typename T>
void AddPixel(const T *array1, const T *array2, T *result, const int nx, const int ny){
	T *d_array1, *d_array2, *d_result;
    size_t size = ((size_t)nx)*ny;
	cudaMalloc((void**)&d_array1, size * sizeof(T));
	cudaMalloc((void**)&d_array2, size * sizeof(T));
	cudaMalloc((void**)&d_result, size * sizeof(T));
	cudaMemcpy(d_array1, array1,  size * sizeof(T), cudaMemcpyHostToDevice);
	cudaMemcpy(d_array2, array2,  size * sizeof(T), cudaMemcpyHostToDevice);
    dim3 block(32,32);
    dim3 grid((nx+block.x-1)/block.x, (ny+block.y-1)/block.y);
    assert(grid.x <= 2147483647);
    assert(grid.y <= 65535);

	AddPixel_Kernel <<< grid, block >>> (d_array1, d_array2, d_result, nx, ny);

	cudaMemcpy(result, d_result, size * sizeof(T), cudaMemcpyDeviceToHost);
	cudaFree(d_array1);
	cudaFree(d_array2);
	cudaFree(d_result);
}

That will work for a y-dimension up to about 2 million (32*65535 = 2,097,120), and x-dimension up to about 2 billion. And as you are working with codes and learning I strongly encourage the use of proper CUDA error checking, and run your codes with compute-sanitizer.

You can also find canonical 2D sample codes in the CUDA samples repository. Here is one example.

Thank you.
By the way, is it necessary to use 2-dimensional grid when speeding up double for loop?
Or can double for loop be made faster even with 1-dimensional grid?

Multi-dimensional blocks and grids are syntactic sugar. You can always convert a linear index into a multi-dimensional index and vice-versa.

Thank you, I have an additional question.
About the following code, are there any other restrictions?

1.The product of BLOCK_X and BLOCK_Y must not exceed 1024.
2.The first dimension can be up to 2^31-1.
3.The second and 3rd dimensions are limited to 65535.

dim3 blockShape = dim3(BLOCK_X, BLOCK_Y);
dim3 gridShape 	= dim3(GRID_X, GRID_Y);

All restrictions are listed in the programming guide, Table 21.

I understand setting the maximum value, but is it possible to set ‘GRID_X or GRID_Y’ to a value that is not a power of 2, such as 3000?

Any value that does not exceed the limit is fine. What suggested that programmers are limited to powers of two?

If that were the case, the documentation would certainly mention such an important restriction.

I would like to speed up the following CPU calculation using CUDA’s two-dimensional grid.

s = 100;
t = 300;

for(j = 0; j < s; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}


for(j = s; j < (s + 100); j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}


for(j = (s + 100); j < t; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

What code is appropriate?

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int	jdx = blockIdx.y * blockDim.y + threadIdx.y;

fooBar

OK, Thank you.

I would like to speed up the following CPU program using cuda’s 2d/3D grid.
Could you please share the sample code?

s = 100;
t = 300;

for(j = 0; j < s; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j]= b[100];
	}
}
for(j = s; j < (s + 100); j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}
for(j = (s + 100); j < t; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

This loop:

is unnecessarily writing to various locations in a multiple times. Even for pure C++ code, that is wasteful and unnecessary. The first step in the process that I would follow would be to refactor that loop so that each location in a is written only once, corresponding to its final value at the completion of the outer for-loop.

Reducing that to a single loop will make the creation of a CUDA kernel to do the same thing a straightforward process. And reducing/refactoring to a single loop is purely a task for your C++ skills, not requiring any CUDA knowledge whatsoever.