We allocate two large blocks of memory, use a kernel to process the data and then free the memory. This cycle is repeated continously for hours at a time and we have been getting after varying numbers of iterations ‘unable to allocate memory’ error. After much testing we have simplified the code to the attached memAlloc.cpp which just allocates and then frees blocks of memory with no GPU kernel processing.
The pseudo code for the attached file is as follows
Set block size to 140 Mbytes
// allocate a large block of memory to ensure incresing block size is possible
Allocate a single 420 Mbyte block of gpu memory
Free the 420 Mbyte block of gpu memory
loop a 1000 times or stop on error
Allocate a block of memory - block1
Allocate a second block of memory - block2
Free block 1
Free block 2
increase the size of the block by 320 bytes
The memory allocation will fail between 500 and 800 iterations
The same program without changing the size of the blocks allocated will run many minutes ( forever ? ). A single block, fixed or changing size will run for many minutes ( forever? ).
GPUs Tested 8800GTX 9800GTX GTX260
Driver 177.35
Windows XP Professional Service Pack 2 ( 32bit )
Intel Quad core PC
AMD Dual core PCs memAlloc.cpp (2.2 KB)
gdb ../../bin/linux/release/seCudaMemSmoke
GNU gdb 6.6-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run
Starting program: /afs/kip.uni-heidelberg.de/user/mbach2/gpu-dev00/NVIDIA/cudaSDK/bin/linux/release/seCudaMemSmoke
[Thread debugging using libthread_db enabled]
[New Thread 47798758420832 (LWP 28537)]
Successful allocate 410156 KBytes memory
Successful free memory
Start loop allocating 2 memory blocks of 136718 KBytes and then free the blocks
The size of the blocks is increased by 320 for each iteration
loop count 100 size 136750 KBytes
loop count 200 size 136781 KBytes
loop count 300 size 136812 KBytes
loop count 400 size 136843 KBytes
loop count 500 size 136875 KBytes
loop count 600 size 136906 KBytes
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47798758420832 (LWP 28537)]
0x00002b790407d483 in ?? () from /usr/lib/libcuda.so
(gdb) backtrace
#0 0x00002b790407d483 in ?? () from /usr/lib/libcuda.so
#1 0x00002b7904070a23 in ?? () from /usr/lib/libcuda.so
#2 0x00002b79040650d9 in ?? () from /usr/lib/libcuda.so
#3 0x00002b7902ce370b in cudaMalloc () from /usr/local/cuda/lib/libcudart.so.2
#4 0x0000000000400c91 in main ()
(gdb)
IMHO, when you have frequent allocations/deallocations, it’s better write your own allocator instead of using cudaMalloc/cudaFree. I wrote my own allocator to avoid fragmentation, and also speed up allocations/deallocations by large factor.