Hi,
I have a code snippet that tries to allocate more memory every try and returns the max amount of pinnable memory, it is relatively easy to do with the CUDA API (asking for free memory), but I need to implement it without CUDA API for other backends such as OpenMP.
We are using the Nvidia NVHPC compiler suite.
The code looks as follows:
size_t allocableSize(size_t base, int deviceId){
bool allocSuccess = true;
// Base is the alignment size, any modern gpu will have ~1 gb of memory
// Then we will keep trying to allocate more and more each try +1 ~1gb
// The minimum increment will be ~250 mb which a multiple of base
size_t min = static_cast<size_t>(2.5 * 1e8);
while (base < min){
base += base;
}
size_t sizeToAlloc = base;
do {
char* allocatedMemory = nullptr;
std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
try{
char* allocatedMemory = new char[sizeToAlloc];
void* u = omp_target_alloc(sizeToAlloc, deviceId);
if (u == NULL){
throw std::runtime_error("No memory (omp_target_allocs)");
} else {
omp_target_free(u, deviceId);
u = NULL;
}
#pragma omp target enter data map(to: allocatedMemory[0:sizeToAlloc]) device(deviceId)
{}
#pragma omp target update from(allocatedMemory[0:sizeToAlloc]) device(deviceId)
{}
sizeToAlloc += base;
free(allocatedMemory);
} catch (int errCode) {
free(allocatedMemory);
allocatedMemory = nullptr;
allocSuccess = false;
}
} while (allocSuccess);
std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
return sizeToAlloc - base;
}
I would expect ~3.6 for my GPU I am currently testing the code with, if I only have omp_target_alloc then the do while loop never breaks, and if I use omp target enter data clause then I got a runtime error, that is impossible to catch, an example output:
Try to alloc: 0.268435 GB
Try to alloc: 0.536871 GB
Try to alloc: 0.805306 GB
Try to alloc: 1.07374 GB
Try to alloc: 1.34218 GB
Out of memory allocating 1342177280 bytes of device memory
Failing in Thread:1
total/free CUDA memory: 4091478016/1325793280
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.6, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7fb5f9fff010 device:0x7fb5b8000000 size:1073741824 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb66bfff010 device:0x7fb63a000000 size:805306368 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb6bdfff010 device:0x7fb69c000000 size:536870912 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb71e7ff010 device:0x7fb6e2000000 size:268435456 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
allocated block device:0x7fb5b8000000 size:1073741824 thread:1
allocated block device:0x7fb63a000000 size:805306368 thread:1
allocated block device:0x7fb69c000000 size:536870912 thread:1
allocated block device:0x7fb6e2000000 size:268435456 thread:1
call to cuMemAlloc returned error 2: Out of memory
In OpenMP standard it says that if the memory cannot be allocated it should return NULL therefore I believe that if the memory cannot be allocated then it should terminate before the target enter data map clause fails.
Could it be the case that omp_target_alloc uses unified memory? Is there a way to prevent omp_target_alloc from using unified memory? and also, is there a way to catch a target enter data map clause that would fail to not fail (and catch that runtime error)