Hi
I usually allocate my host memory with the cudaHostAlloc function in combination with the cudaHostAllocPortable flag because I have many big 3d volumina which were often transferred. The total number of those voluminas is pending, sometimes only a few, sometimes many. My host has 64GB RAM an in worst (usual) case i am allocating blocks with a totla memory of about 32 GB. And here i get sometimes a total crash in the cudaHostAlloc function (no error code, no exception, total crash) because it seems that there is not enough memory free to make a further pinned memory alloc on host…
Currently i let the user decide if he is using pinned or normal host memory:
if ( m_bUsePinnedMem )
{
m_nError = cudaHostAlloc( &oBlockMemory.poHostPointer, m_nMemoryExtentSizeInBytes, cudaHostAllocPortable ); //this crashes sometimes, arg, no try catch works
}
else
{
m_nError = cudaMallocHost( &oBlockMemory.poHostPointer, m_nMemoryExtentSizeInBytes );
}
So my question is: Is there a way to dynamically check if the cudaHostAlloc function will work?
small update/appendix:
there is a print in console “out of memory”
and i retested it without using of pinned memory, it seem also to crash somethimes without the cudaHostAllocPortable flag… but on taskmanager my executable has only about 20GB whereby my RAM is 64GB…
so question remains: is there a way of apriory check for the cudaMallocHost function or a kind of try catch solution for an alternative handling…
i am guessing: my intuition is that pinned memory allocation relative success may equally depend on factors other than mere memory size (64GB)
for one, the amount of new memory requested, and whether such an amount can be successfully pinned - in other words, how (much) you pin (at a time)
you may be using 20GB, and thus have (64 - 20) GB left, but only an additional 5GB can be pinned, given the memory already pinned
even if ‘the total number of those voluminas is pending’, you should be able to determine the maximum pinned memory required at any time, given the maximum memory ‘in flight’ at any given point, and the maximum pinned memory required, when also accounting for mechanisms to hide latency
i doubt whether the device’s ‘in flight’ memory requirements would span such a vast amount (32GB)
I have 2 K10 Teslas on a windows server 2012 Standard x64 machine, Intel Xeon CPU E2670 @ 2.6GHz and 64GB RAM… Software compiled with VS 2013 x64 and cuda 6.5.14 (i know, quite old, but project is also very old :-) ). I am also using AVX if this is important…
The strange thing is that my testdataset works if i make my “Volumina Block Size” bigger, which means:
128x128x128 Elements (1Element = 8byte) → crash after 90 allocs
256x256x256 Elements → crash after 73 allocs
512x512x512 Elements → full dataset with 24 Blocks worked without crash
on my 128x128x128 test i printed the GlobalMemoryStatusEx Windows function, here are my last 2 printings (printing done before cudaMallocHost call) before crash:
Host memory before block alloc:
->There is 9 percent of memory in use.
->There are 67080008 total KB of physical memory.
->There are 60823408 free KB of physical memory.
->There are 75992904 total KB of paging file.
->There are 68933196 free KB of paging file.
->There are 8589934464 total KB of virtual memory
->There are 8474535524 free KB of virtual memory.
->There are 0 free KB of extended memory.
(next print before crash)
Host memory before block alloc:
->There is 9 percent of memory in use.
->There are 67080008 total KB of physical memory.
->There are 60801692 free KB of physical memory.
->There are 75992904 total KB of paging file.
->There are 68921096 free KB of paging file.
->There are 8589934464 total KB of virtual memory
->There are 8474537608 free KB of virtual memory.
->There are 0 free KB of extended memory.
CRASH
the software uses prarllel Data streaming on all 4 GPUs
so from the theory, 128128128*8 = 16384kB shouldn’t be a problem… and it defenitely does not depend on the cudaHostAllocPortable flag (my first assumption was wrong, sorry)
ok, small addon:
if i say block size of 128 then it is 128+2*2 because my blocks have an overlap of 2 voxels (performance reason on border handling), should not be relevant
ok, i now made a small test app:
void TEST( int argc, char** argv )
{
int nBlockSize, nAllocs;
sscanf( argv[1], "%d", &nBlockSize );
sscanf( argv[2], "%d", &nAllocs );
cudaExtent oE = make_cudaExtent( 8 * nBlockSize, nBlockSize, nBlockSize );
int nMemoryExtentSizeInBytes = oE.width * oE.height * oE.depth;
std::vector<void*> oMem( nAllocs );
for ( int i = 0; i < nAllocs; i++ )
{
std::cout << i << std::endl;
std::cout << "-->" << cudaMallocHost( &oMem[i], nMemoryExtentSizeInBytes ) << std::endl;
}
}
With 128 and 4000 it worked perfectly, so i don’t know why it does not work on my app (fragmentation?)… I think i will prellocate 1000 blocks apriory, it is not elegant but will solve it… i don’t see an other solution…
i would interpret the results as supporting, or at least correlating with, the point raised before that pinned memory allocation success is dependent on how you allocate - previous allocations, and the new allocation request
i would think that ordinary memory allocation is much more lenient than pinned memory allocation, as it is virtual - floating - and page-able
pinned memory is exactly the opposite
memory allocation management should be easier for ordinary memory then
i think pinned memory might even fragment, for the reasons above
if this is so, fewer allocations (one allocation) self-managed by the application should have better results
@little_jimmy: You mean something like writing your own allocator for pinned memory, which at start allocates a huge block of pinned memory and then manages this memory pool (manages allocation / deallocation requests) ?
I suppose that could be better than the operating system provided one, as one can adapt the allocator to the allocation/de-allocation characteristics of a certain application.
yes, more or less
i had this in mind:
if pinned memory allocation can fragment when multiple such allocations are done subsequently and serially, it may be better to forward allocate a single block - allocation - of pinned memory, more or less according to the application’s total requirement, after considering and accounting for possible pinned memory reuse - the maximum unique pinned memory requirement by the application at any point in time
the application would then simply allocate pinned memory to functions, by passing an appropriate pointer as offset into the block to functions
the application can keep track of pinned memory already in use, and would simply offset pinned memory within the block, such that functions do not step on each others’ toes
the application itself may also be better informed of future requests/ frees, and may adjust offsets accordingly
this is exactly the reason why pinned memory allocation/ management may be troublesome:
the host - or os - may be unaware of future allocations/ deallocations: allocation/ deallocation may occur on the fly - new ones instigated, with old ones destroyed
to remove fragmentation, the host - os - must physically rewrite and adjust the pinned memory addresses, which may be troublesome, to say the least
a single allocation also has the benefit of reduced overhead associated with the actual allocation of pinned memory
continuously allocating/ freeing pinned memory may be expensive; passing pointers is cheap
ok, i will try that “preallocation” but it is a little bit awful since it is not exactly clear how many i would need
i also have to say that i (except small (help)variables) reuse memory and do not free/reallocate it… so fragementation should be very optimized…maybe some variables in external 3rd party libs…
reuse can mean many things
“it is a little bit awful since it is not exactly clear how many i would need”
you should be able to estimate the value with reasonable certainty, for the simple reason that you should verify a priori that you have sufficient physical memory in your host, beforehand
you seem to be working with blocks of data
what is the variance of the block sizes - the difference between the smallest possible block, and largest possible block, and how does the variance relate to average size of blocks?
Reuse means:
- Number of blocks do not decrease during computation (so there is only a free after all is finished, not during computation)
- Big Arrays are not reallocated on each subroutine (maybe in some 3rd party functions/helper variables)
The blocksize is constant during computation and can be set at startup. It is only relevant for performance reasons since 1 block is computed at each cuda call/GPU. If block is smaller you need more blocks to process same ammount of data… Only border voxels are multiple stored… On my performance tests 256 seems best.
The increasing “unknown” is the number of blocks which refers to
- my data ressolution
- the input data set
In fact it is a 3d Object scanning software so the blocks refer to voluminas, and you can scan arbitrary objects (small of 5cm) to scale of buildings, ressolution depends on used camera accurancy…
Yes for sure i could determine memory and allocate 70-80% of it but i do not really like that solution but seems to be the only one…
“It is only relevant for performance reasons since 1 block is computed at each cuda call/GPU. If block is smaller you need more blocks to process same ammount of data”
i presume you process the 3d objects, and do not only render them
"The increasing “unknown” is the number of blocks which refers to
- my data ressolution
- the input data set"
the 1st item seems to merely imply a multiplier factor of memory requirement per block processed
the 2nd item seems to imply a multiplier factor of the number of blocks to process
i would again point to the measure of the number of blocks the device(s) is (are) to process, relative to the total number of such blocks a device can process simultaneously at any given time
from your data requirements, i presume the former far greater than the latter
if this is so, it is pointless to attempt to feed the device more than it can consume at any point, if there is no good reason to wish to do so
if you work with the amount of blocks the device simultaneously consumes, rather than the total number of blocks to process, your memory requirement uncertainty now scales with a far smaller variable, and is a lot more manageable, and a natural uncertainty ceiling is implied as well
and this does not necessarily imply greater latency/ delays
hence, i would question and investigate the minimum amount of the input data set that must be present on the device at any given time