Hi,
i want to use the gpu for iterativ solvers. so i need for example the addition of 2 vectors.
This is my short programm (even without templates and expression templates).
$
#include <iostream>
#include <assert.h>
#define BLOCK_SIZE 512
__global__ void vector_plus(int *d_out, int *d_in, int dim)
{
int id= blockIdx.x * blockDim.x + threadIdx.x;
if(id < dim)
d_out[id] += d_in[id];
}
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
// pointer for host memory and size
int *h_a;
int dimA = 33*1000*1000, grid= dimA/BLOCK_SIZE+1;
int *d_b, *d_a;
size_t memSize = dimA * sizeof(int);
h_a = (int *) malloc(memSize);
std::cout<< "grid\n"<< grid << "\n";
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );
for (int i = 0; i < dimA; i++)
h_a[i] = 2;
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
for (int i = 0; i < dimA; i++)
h_a[i] = 1;
cudaMemcpy( d_b, h_a, memSize, cudaMemcpyHostToDevice );
dim3 dimGrid(grid), dimBlock(BLOCK_SIZE);
vector_plus<<< dimGrid, dimBlock >>>( d_b, d_a , dimA);
cudaThreadSynchronize();
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );
The program is ok and works until 33Mio also true. But vor 34Mio i get wrong results.
The Problem is that the variable grid is bigger then 65535.
Maximums on C1060
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
There is also a possibility for more than 33Mio elements to add??? If so, can someone explain me this approch??
Or is this the Limit on cuda?
stonator