I have a question about CUDA program.
I created a program that adds arrays together using the GPU.
A part of kernel function in our program is as follows:

program A)
const int size = 1000*2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i >= size) return;
result[i] = array1[i] + array2[i];

return;

program B)
const int nx = 1000;
const int ny = 2000;
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = nx * j + i;

if (i < nx && j < ny)
result[k] = array1[k] + array2[k];
return;

Result)
The addition of arrays “program A” can be performed without any problems, but the result of the addition of arrays “program B” is 0.

Is there anything inappropriate?
I would like to perform efficient GPU calculations using “int i” and “int j”.

Your kernel design pretty much expects a 2D grid/threadblock, but you are launching 1D grid of 1D threadblocks. There are many resources online that can help you to formulate a 2D grid/launch, here is one example.

The product of the blockShape x and y dimensions should not exceed 1024 (the exact number depending on your GPU architecture).
The grid dimensions can be large.

any time you’re having difficulty with a CUDA code, I recommend proper CUDA error checking. It will be useful for others who are trying to help you, and eventually it will help you sort out problems on your own.

Was just an example. It would run around 10^12 blocks. (Both numbers are multiplied). Or 10^15 threads.
Either it lead to a timeout or to a memory out-of-bounds error in your kernel.

Check the return values of all CUDA calls and of the cudaDeviceSynchronize().

#include <cassert>
template <typename T>
__global__ void AddPixel_Kernel(const T *array1, const T *array2, T *result, const int nx, const int ny) {
size_t i = blockIdx.x * blockDim.x + threadIdx.x; // width
size_t j = blockIdx.y * blockDim.y + threadIdx.y; // height
size_t k = nx * j + i;
if (i < nx && j < ny) {
result[k] = array1[k] + array2[k];
}
}
template <typename T>
void AddPixel(const T *array1, const T *array2, T *result, const int nx, const int ny){
T *d_array1, *d_array2, *d_result;
size_t size = ((size_t)nx)*ny;
cudaMalloc((void**)&d_array1, size * sizeof(T));
cudaMalloc((void**)&d_array2, size * sizeof(T));
cudaMalloc((void**)&d_result, size * sizeof(T));
cudaMemcpy(d_array1, array1, size * sizeof(T), cudaMemcpyHostToDevice);
cudaMemcpy(d_array2, array2, size * sizeof(T), cudaMemcpyHostToDevice);
dim3 block(32,32);
dim3 grid((nx+block.x-1)/block.x, (ny+block.y-1)/block.y);
assert(grid.x <= 2147483647);
assert(grid.y <= 65535);
AddPixel_Kernel <<< grid, block >>> (d_array1, d_array2, d_result, nx, ny);
cudaMemcpy(result, d_result, size * sizeof(T), cudaMemcpyDeviceToHost);
cudaFree(d_array1);
cudaFree(d_array2);
cudaFree(d_result);
}

That will work for a y-dimension up to about 2 million (32*65535 = 2,097,120), and x-dimension up to about 2 billion. And as you are working with codes and learning I strongly encourage the use of proper CUDA error checking, and run your codes with compute-sanitizer.

You can also find canonical 2D sample codes in the CUDA samples repository. Here is one example.

Thank you.
By the way, is it necessary to use 2-dimensional grid when speeding up double for loop?
Or can double for loop be made faster even with 1-dimensional grid?

Thank you, I have an additional question.
About the following code, are there any other restrictions?

1.The product of BLOCK_X and BLOCK_Y must not exceed 1024.
2.The first dimension can be up to 2^31-1.
3.The second and 3rd dimensions are limited to 65535.