Maximum memory allocation size

11bolts · January 21, 2012, 5:30am

Hi. I’m getting an unspecified launch failure in my code. After reading all of the various posts on this issue, it seems that most people with this problem had a memory access mistakes in their kernels or allocation mistakes in their host code. For my code however, I have good reason to think that there are no such access or allocation mistakes. Furthermore, the behavior of the code suggests that I’m dealing with some unknown (to me) memory limit.

My code operates on a contiguous block of memory (allocated by cudaMalloc) of size 12 * X * Y * sizeof(type), where X, Y and type are specified by the user. For type = double, my code works perfectly for sufficiently small X and Y, but fails somewhere between X * Y = 700 * 350 and 750 * 375. That’s not very big - around 25 Mb, so it’s clearly not a device memory limit (4 gigs). If I change my type to float however, the code once again functions perfectly. If I keep type = float and double the system size, the failure occurs again.

In one of the many posts I’ve read, someone suggested a limit to the amount of contiguous memory that could be allocated with cudaMalloc, but didn’t elaborate. Is there any merit to this idea? Can someone suggest some techniques for investigating this problem? Thanks.

tera · January 21, 2012, 11:03am

You might not be able to allocate the whole device memory in one cudaMalloc() call because of address space fragmentation (there might not be a contiguous free part of address space of that size left). I’d be surprised however if this already prevented allocation of just 25Mb.
For the definitive answer whether or not your cudaMalloc() call is failing, check its return code.

To see whether any other stray memory access cause the problem, run your program under [font=“Courier New”]cuda-memcheck[/font].

pasoleatis · January 21, 2012, 6:43pm

Hello. You need to establish first where exactly is your code crashing, I allocated 1D arrays of GB.

11bolts · January 23, 2012, 5:08pm

Thanks for the responses. Yeah - I’m pretty sure it’s not fragmentation, since the 25Mb array is by far the largest thing I allocate when I run the code.

You’re right. My initial error trapping was lazy. I’ve since improved it, and I now know that I’m getting “unspecified launch failure” in my kernel. I then made a very simple code to try and reproduce the problem. It too fails for an array size that should be easily manageable, however it fails with a different error message “invalid configuration argument”. Coincidence? I hope not, or else I have two problems, since this test code REALLY should work. Here is the test code:

Main code:

#include <stdio.h>

typedef double real;

#include <kernel.c>

void checkCUDAError(const char* msg);

int main( int argc, char *argv[])

{

  real *array_h;

  int i;

int BLOCK = atoi(argv[2]);

  int SIZE = BLOCK * atoi(argv[1]);  //argv[1] = 65535 works

printf("\n ARRAY SIZE = %d, BLOCK SIZE = %d \n\n", SIZE, BLOCK);

array_h = (real*) malloc(SIZE * sizeof(real));

for( i=0; i<SIZE; i++)

  {

    array_h[i] = 0.;

  }

real *array;

  cudaMalloc((void**)&array, SIZE * sizeof(real));

  checkCUDAError("malloc");

cudaMemcpy(array, array_h, SIZE * sizeof(real), cudaMemcpyHostToDevice);

  checkCUDAError("memcpy");

dim3 block(BLOCK);

  dim3 grid(SIZE / BLOCK);  

kernel<<<grid, block>>>(array);

  cudaThreadSynchronize();

  checkCUDAError("kernel");

cudaMemcpy(array_h, array, SIZE * sizeof(real), cudaMemcpyDeviceToHost);

  checkCUDAError("memcpy");

free(array_h);

  cudaFree(array);

  return 0;

}

//-----------------------------------------------------------

//From Dr Dobbs "CUDA: Supercomputing for the masses, Part 3"

//http://drdobbs.com/architecture-and-design/207200659      

//-----------------------------------------------------------

void checkCUDAError(const char *msg)

{

    cudaError_t err = cudaGetLastError();

    if( cudaSuccess != err) 

    {

        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

                             cudaGetErrorString( err) );

        exit(EXIT_FAILURE);

    }                         

}

… and the very simple kernel:

__global__ void kernel(real* array)

{

  int n = threadIdx.x + blockDim.x*blockIdx.x;

array[n] = 42.;

}

This code works for SIZE = 16 * 65535 but fails for SIZE = 16 * 65536.

In case you’re wondering, I have a machine with 8 Tesla C1060s running CUDA Driver version 3.20.

11bolts · January 23, 2012, 5:17pm

I just realized how easy it is to use cuda-memcheck, so I did. My test code reports no errors, but my regular code reports one out-of-bounds error. I guess that means I’ll go through my regular kernel again looking for bad addresses. That still doesn’t explain why the test code fails.

mfatica · January 23, 2012, 5:17pm

The maximum size for each dimension in a grid of blocks is 65535.

You have few options:

increase the number of threads per block (16 is a bad choice anyway)
process more than one element for thread
use 2D grids

BTW you can see most of the limits running deviceQuery or reading the manual:
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

11bolts · January 23, 2012, 5:37pm

Duh! Of course! I’m a fool!

Thanks mfatica External Image

11bolts · January 24, 2012, 7:27pm

To follow up: Problem solved. The problem in my original code was a simple coding mistake after all. I’m not really sure why the code ran for small enough arrays but not for large ones, but now that it’s solved it doesn’t really matter. Thanks to everyone for their help.