How to manipulate a 1D array of 128 million integers


I need to build a small library of kernels that will process 1D arrays of 128 million integers. All of the examples I have seen
deal with 2D problems, so I was hoping I could get a sanity check from the experts to see if my approach is correct for a 1D
problem. Here is my main and mykernel.

dynamically Alloc A1
dynamically Alloc B1
dynamically Alloc C1

N = 128,000,000
NSQRT = int(sqrt(N))+1

dim3 dimBlock(16,16)
dim3 dimGrid( (NSQRT+dimBlock.X-1)/dimBlock.X, (NSQRT+dimBlock.Y-1)/dimBlock.Y)
mykernel<<dimGrid, dimBlock>> (A1, B1, C1, N)

mykenel (int *A, int B, int C)
int Q = gridDim.X
blockIdx.Y + blockIdx.x
int R = Q * blockDim.X * blockDimY
int IX = R + blockDim.X
threadIdx.Y + threadIdx.x

if (IX < N)
A[IX] = B[IX] + C[IX]

Since I have 128 million (2**27) integers to manipulate, I thot I would launch a kernel with 128 million threads.
I decided each thread block would have 256 threads.
I (arbitrarily) decided to make the grid square, so I computed the square root of N in my dimGrid initializations.

Inside of my kernel I now have a 2D grid that is composed of 2D thread blocks.
I simply go thru some calculations to create IX, the 1D index into the arrays, from all of the 2D grid and block information.

  1. Is my overall approach reasonable?
  2. Are there “cleaner” or “better” ways?




In my eyes your indexing is a bit unintuitive. This is my way:

mykenel (int *A, int *B, int *C)


   int Q  = blockIdx.x * blockDim.x + threadIdx.x;

   int R  = blockIdx.y * blockDim.y + threadIdx.y;

int IX = gridDim.x * blockDim.x * R + Q;

if (IX < N) 

		A[IX] = B[IX] + C[IX]




Why use 2D blocks when you’re dealing with 1D arrays?

Vector addition can be made as simple as

__global__ void myKernel(int*A, int*B, int *out)


int tid = threadIdx.x + blockIdx.x * blockDim.x;

if(tid < N)

	out[tid] = A[tid] + B[tid];




dynamically Alloc A1 //on device

dynamically Alloc B1  //on device

dynamically Alloc C1  //on device

N = 128,000,000;

dim3 dimBlock(256, 1, 1);

dim3 dimGrid(N/256, 1, 1);

myKernel<<<dimGrid, dimBlock>>>(A1, B1, C1);


@BigMac: Your solution will fail to launch, there is a limit of 65K for each dimension of dimGrid.

@bzigon: use a fixed 1D grid and have each thread works on multiple elements. Several SDK examples use this arrangement ( I think MonteCarlo is one of them)

Oh crap, totally forgot about that :)