Device malloc cause error

I am using BigInteger math in my program. And when I am increasing number of elements to process I am running into next issue:
========= Invalid global write of size 4 bytes
========= at CudaBI_FindProbablePrimes(unsigned char *, int, int, unsigned char *)+0x138 in /home/om/projects/cuda-rsa-generator/cuda/cuda_operations.cu:743
========= by thread (319,0,0) in block (24,0,0)
========= Address 0x200000000 is out of bounds
========= and is 113 246 208 bytes before the nearest allocation at 0x206c00000 of size 2 000 027 904 bytes

Code that cause it (result.value[0] = input;):
device CudaBI CudaBI_FromInt(int sign, uint32_t input) {
CudaBI result;
result.sign = sign;
result.length = 1;
result.value = (uint32_t*) malloc(sizeof(uint32_t) * 1);
result.value[0] = input;
return result;
}

CudaBI_FromInt called in loop. The error only present when I am allocating 10000 or more elements.

Here is how my structure defined:
struct CudaBI
{
int sign;
int length;
uint32_t* value;
};

I do call free(number.value);
I think that malloc return invalid address for some of itterations.
What else could it be?

Can you confirm by testing the result of malloc against 0x200000000?

You can also try cudaMalloc on the device, which returns an error code.

I try to use cudaMalloc + cudaFree so my code now looks like this:
device CudaBI CudaBI_FromInt(int sign, uint32_t input) {
CudaBI result;
result.sign = sign;
result.length = 1;
cudaMalloc((void**)&result.value, malloc(sizeof(uint32_t) * 1);
result.value[0] = input;
return result;
}

cudaFree(number.value);

And I got next error:
========= Program hit cudaErrorLaunchFailure (error 719) due to “unspecified launch failure” on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x4466f5]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cudaMemcpy [0x701d9]
========= in /home/om/projects/cuda-rsa-generator/cuda/libcudarsagen
========= Host Frame:main [0xbc42]
========= in /home/om/projects/cuda-rsa-generator/cuda/libcudarsagen
========= Host Frame:__libc_start_call_main in …/sysdeps/nptl/libc_start_call_main.h:58 [0x29d8f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in …/csu/libc-start.c:392 [0x29e3f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0xab24]
========= in /home/om/projects/cuda-rsa-generator/cuda/libcudarsagen

That is certainly possible and is the only way that device malloc has to signal an error, such as out of memory.

There are numerous web questions around the use of device malloc (example).

  • as a diagnostic, check that the returned pointer value is not zero (NULL pointer). That is how device malloc signals an error.
  • the device malloc allocates from a limited space (“device heap”) which is by default not the same size as your device memory. By default it is 8MB
  • you can adjust the available space with a runtime API call in host code. (cudaDeviceSetLimit(cudaLimitMallocHeapSize, size);)
  • repeated malloc/free on the device heap can result in fragmentation
  • there is an allocation granularity. Repeated small allocations may use substantially more memory than you would predict