vector limits in cuda

Hi,

i want to use the gpu for iterativ solvers. so i need for example the addition of 2 vectors.

This is my short programm (even without templates and expression templates).

$

#include <iostream>

#include <assert.h>

#define BLOCK_SIZE 512

__global__ void vector_plus(int *d_out, int *d_in, int dim)

{

	int id= blockIdx.x * blockDim.x + threadIdx.x;

	if(id < dim)

		d_out[id] += d_in[id];

}

////////////////////////////////////////////////////////////////////////////////

int main( int argc, char** argv)

{

	// pointer for host memory and size

	int *h_a;

	int dimA = 33*1000*1000, grid= dimA/BLOCK_SIZE+1;

	int *d_b, *d_a;

	size_t memSize = dimA * sizeof(int);

	h_a = (int *) malloc(memSize);

	std::cout<< "grid\n"<< grid << "\n";

	cudaMalloc( (void **) &d_a, memSize );

	cudaMalloc( (void **) &d_b, memSize );

	for (int i = 0; i < dimA; i++)

		h_a[i] = 2;

	cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

	for (int i = 0; i < dimA; i++)

		h_a[i] = 1;

	cudaMemcpy( d_b, h_a, memSize, cudaMemcpyHostToDevice );

	dim3 dimGrid(grid), dimBlock(BLOCK_SIZE);

	vector_plus<<< dimGrid, dimBlock >>>( d_b, d_a , dimA);

	cudaThreadSynchronize();

	cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

The program is ok and works until 33Mio also true. But vor 34Mio i get wrong results.

The Problem is that the variable grid is bigger then 65535.

Maximums on C1060

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

There is also a possibility for more than 33Mio elements to add??? If so, can someone explain me this approch??

Or is this the Limit on cuda?

stonator

There are a number of ways you can expand the range of your kernel. One way would be to use a two dimensional grid. That way you can have 65535*65535 blocks (a bit more than 4e9). You could also have each thread perform more than one addition, for example

__global__ void vector_plus(int *d_out, int *d_in, int dim)

{

	unsigned int id= blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int gridsize = blockDim.x * gridDim.x;

	for(unsigned int i=id; i < dim; i+=gridsize)

		d_out[i] += d_in[i];

}

would probably work.