CUDA Program Issue

a.j · September 4, 2024, 1:30am

I have a question about CUDA program.
I created a program that adds arrays together using the GPU.
A part of kernel function in our program is as follows:

program A)
const int size = 1000*2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i >= size) return;
result[i] = array1[i] + array2[i];

return;

program B)
const int nx = 1000;
const int ny = 2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = nx * j + i;

if (i < nx && j < ny)
result[k] = array1[k] + array2[k];
return;

Result)
The addition of arrays “program A” can be performed without any problems, but the result of the addition of arrays “program B” is 0.

Is there anything inappropriate?
I would like to perform efficient GPU calculations using “int i” and “int j”.

Curefab · September 4, 2024, 10:01am

Format your code with the </> button, please. How do you call/launch your kernels? Are the grid and block dimensions set accordingly?

Robert_Crovella · September 5, 2024, 2:33am

Your kernel design pretty much expects a 2D grid/threadblock, but you are launching 1D grid of 1D threadblocks. There are many resources online that can help you to formulate a 2D grid/launch, here is one example.

Curefab · September 5, 2024, 9:14am

The product of the blockShape x and y dimensions should not exceed 1024 (the exact number depending on your GPU architecture).
The grid dimensions can be large.

You could do

dim3 blockShape = dim3(32, 32);
dim3 gridShape = dim3(1048576, 1048576);

or

dim3 blockShape = dim3(1024, 1);
dim3 gridShape = dim3(1048576, 1048576);

Robert_Crovella · September 5, 2024, 12:27pm

any time you’re having difficulty with a CUDA code, I recommend proper CUDA error checking. It will be useful for others who are trying to help you, and eventually it will help you sort out problems on your own.

a.j · September 9, 2024, 4:18am

I changed it to the following settings and it worked correctly.

dim3 blockShape = dim3(1000, 1);
dim3 gridShape = dim3(3, 3000);

If possible, please tell me the mechanism that worked correctly.

Curefab · September 9, 2024, 8:40am

The

Was just an example. It would run around 10^12 blocks. (Both numbers are multiplied). Or 10^15 threads.
Either it lead to a timeout or to a memory out-of-bounds error in your kernel.

Check the return values of all CUDA calls and of the cudaDeviceSynchronize().

Robert_Crovella · September 9, 2024, 2:56pm

This is an illegal shape. The first dimension can be up to 2^31-1. The second and 3rd dimensions are limited to 65535.

I encourage you to use proper CUDA error checking.

a.j · September 9, 2024, 11:33pm

Thank you for your cooperation.
I want to master 2D grid,
so could you please share some sample code?

Robert_Crovella · September 10, 2024, 2:58am

#include <cassert>
template <typename T>
__global__ void AddPixel_Kernel(const T *array1, const T *array2,  T *result, const int nx, const int ny) {

	size_t i = blockIdx.x * blockDim.x + threadIdx.x; // width
	size_t j = blockIdx.y * blockDim.y + threadIdx.y; // height
	size_t k = nx * j + i;

	if (i < nx && j < ny) {
		result[k] = array1[k] + array2[k];
	}
}

template <typename T>
void AddPixel(const T *array1, const T *array2, T *result, const int nx, const int ny){
	T *d_array1, *d_array2, *d_result;
    size_t size = ((size_t)nx)*ny;
	cudaMalloc((void**)&d_array1, size * sizeof(T));
	cudaMalloc((void**)&d_array2, size * sizeof(T));
	cudaMalloc((void**)&d_result, size * sizeof(T));
	cudaMemcpy(d_array1, array1,  size * sizeof(T), cudaMemcpyHostToDevice);
	cudaMemcpy(d_array2, array2,  size * sizeof(T), cudaMemcpyHostToDevice);
    dim3 block(32,32);
    dim3 grid((nx+block.x-1)/block.x, (ny+block.y-1)/block.y);
    assert(grid.x <= 2147483647);
    assert(grid.y <= 65535);

	AddPixel_Kernel <<< grid, block >>> (d_array1, d_array2, d_result, nx, ny);

	cudaMemcpy(result, d_result, size * sizeof(T), cudaMemcpyDeviceToHost);
	cudaFree(d_array1);
	cudaFree(d_array2);
	cudaFree(d_result);
}

That will work for a y-dimension up to about 2 million (32*65535 = 2,097,120), and x-dimension up to about 2 billion. And as you are working with codes and learning I strongly encourage the use of proper CUDA error checking, and run your codes with compute-sanitizer.

You can also find canonical 2D sample codes in the CUDA samples repository. Here is one example.

a.j · September 11, 2024, 2:56am

Thank you.
By the way, is it necessary to use 2-dimensional grid when speeding up double for loop?
Or can double for loop be made faster even with 1-dimensional grid?

striker159 · September 11, 2024, 4:23am

Multi-dimensional blocks and grids are syntactic sugar. You can always convert a linear index into a multi-dimensional index and vice-versa.

a.j · September 11, 2024, 6:14am

Thank you, I have an additional question.
About the following code, are there any other restrictions?

1.The product of BLOCK_X and BLOCK_Y must not exceed 1024.
2.The first dimension can be up to 2^31-1.
3.The second and 3rd dimensions are limited to 65535.

dim3 blockShape = dim3(BLOCK_X, BLOCK_Y);
dim3 gridShape 	= dim3(GRID_X, GRID_Y);

striker159 · September 11, 2024, 6:34am

All restrictions are listed in the programming guide, Table 21.

a.j · September 11, 2024, 7:04am

I understand setting the maximum value, but is it possible to set ‘GRID_X or GRID_Y’ to a value that is not a power of 2, such as 3000?

njuffa · September 11, 2024, 7:31am

Any value that does not exceed the limit is fine. What suggested that programmers are limited to powers of two?

If that were the case, the documentation would certainly mention such an important restriction.

a.j · September 18, 2024, 1:35am

I would like to speed up the following CPU calculation using CUDA’s two-dimensional grid.

s = 100;
t = 300;

for(j = 0; j < s; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}


for(j = s; j < (s + 100); j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}


for(j = (s + 100); j < t; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

What code is appropriate?

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int	jdx = blockIdx.y * blockDim.y + threadIdx.y;

fooBar

a.j · September 18, 2024, 2:39am

OK, Thank you.

a.j · September 19, 2024, 11:39pm

I would like to speed up the following CPU program using cuda’s 2d/3D grid.
Could you please share the sample code?

s = 100;
t = 300;

for(j = 0; j < s; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j]= b[100];
	}
}
for(j = s; j < (s + 100); j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}
for(j = (s + 100); j < t; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

Robert_Crovella · September 20, 2024, 4:30pm

This loop:

is unnecessarily writing to various locations in a multiple times. Even for pure C++ code, that is wasteful and unnecessary. The first step in the process that I would follow would be to refactor that loop so that each location in a is written only once, corresponding to its final value at the completion of the outer for-loop.

Reducing that to a single loop will make the creation of a CUDA kernel to do the same thing a straightforward process. And reducing/refactoring to a single loop is purely a task for your C++ skills, not requiring any CUDA knowledge whatsoever.

Topic		Replies	Views
How to realise 3 dimension calculation? CUDA Programming and Performance	9	243	August 28, 2023
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18144	October 17, 2013
looking for further suggestion to speed up the code CUDA Programming and Performance	9	1266	February 4, 2014
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
limit of computation CUDA Programming and Performance	44	32900	April 8, 2008
3D arrays - where to start? CUDA Programming and Performance	12	20092	October 13, 2013
Unknown problem in cuda program throws exception CUDA Programming and Performance	1	3447	May 30, 2008
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
help to clairfy usage of number of grids and number of blocks in kernal CUDA Programming and Performance	0	611	February 14, 2014
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008

CUDA Program Issue

Related topics