Array limit

Hello!

I tried to allocate an array of 1664340340=118,374,400 doubles via
cudaMalloc((void **) &arr, sizeof(double)6416
340340);
or via
cudaMallocPitch((void **) &arr, &pitch_arr, sizeof(double)6416, 340
340);
and for both cases I got failure (“unspecified”) in a test kernel that just writes 1.0 to an array element:

global void kernel_fill(double arr) {
int x = blockIdx.x;
int y = blockIdx.y;
for (int i=0; i< 64
16; i++) {
arr[(y*340+x)6416+i] = 1.0;
}
}

kernel_fill<<<dim3(340,340),dim3(1,1)>>>(arr);

If to decrease the array length e.g. in 16 times, it works correctly.
Is there some limit for 1D (cudaMalloc) or 2D (cudaMallocPitch) array size in CUDA for CC=2.0?

Thanks in advance.

Assuming you have a card with enough memory (the array alone is 1GB) , it should work.
Your block/grid configuration is terrible by the way.

Yes, it’s Tesla with 6G (in non-ECC). Block/grid config is not nice just for clear reading. Other grid configs give the same - failure for huge array, ok for a less size.

It works just fine.

#include <stdio.h>

#include <assert.h>

# define MIN(x,y) ((x)<(y)?(x):(y))

__global__ void kernel_fill(double *arr, int N)

{

 // Thread index

 int i,tid=threadIdx.x+blockDim.x*blockIdx.x;

 // Total number of threads in grid execution 

 int numThreads=blockDim.x*gridDim.x;

for (i=tid; i<N; i+=numThreads) arr[i]=1.0;

}

int main()

{

double *cpu_array, *gpu_array;

 int i,N=16*64*340*340;

 size_t arraySize=sizeof(double)*N;

 int blockSize=512,gridSize;

printf ("Number of elements: %d \n",N);

 cpu_array=(double *) malloc(arraySize);

 cudaMalloc((void **)&gpu_array,arraySize);

gridSize=MIN((N+blockSize-1)/blockSize,65535);

 printf ("Number of blocks: %d \n",gridSize);

kernel_fill<<<gridSize,blockSize>>>(gpu_array,N);

cudaMemcpy(cpu_array,gpu_array,arraySize,cudaMemcpyDeviceToHost);

for (i=0; i<N; i++) assert(cpu_array[i]==1.0);

free(cpu_array);

 cudaFree(gpu_array);

}

Sorry, it was my fault, an overflow in variable that participated in array size calculating, therefore a smaller size was allocated in cudaMalloc.

Thanks for your response!