Omp_target_alloc never returns NULL

Hi,

I have a code snippet that tries to allocate more memory every try and returns the max amount of pinnable memory, it is relatively easy to do with the CUDA API (asking for free memory), but I need to implement it without CUDA API for other backends such as OpenMP.

We are using the Nvidia NVHPC compiler suite.
The code looks as follows:

size_t allocableSize(size_t base, int deviceId){
bool allocSuccess = true;

// Base is the alignment size, any modern gpu will have ~1 gb of memory
// Then we will keep trying to allocate more and more each try +1 ~1gb
// The minimum increment will be ~250 mb which a multiple of base
size_t min = static_cast<size_t>(2.5 * 1e8);
while (base < min){
    base += base;
}
size_t sizeToAlloc = base;

do {
    char* allocatedMemory = nullptr;
    std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
    try{
        char* allocatedMemory = new char[sizeToAlloc];
        void* u = omp_target_alloc(sizeToAlloc, deviceId);
        if (u == NULL){
            throw std::runtime_error("No memory (omp_target_allocs)");
        } else {
            omp_target_free(u, deviceId);
            u = NULL;
        }
        #pragma omp target enter data map(to: allocatedMemory[0:sizeToAlloc]) device(deviceId)
        {}
        #pragma omp target update from(allocatedMemory[0:sizeToAlloc]) device(deviceId)
        {}
        sizeToAlloc += base;
        free(allocatedMemory);
    } catch (int errCode) {
        free(allocatedMemory);
        allocatedMemory = nullptr;
        allocSuccess = false;
    }
} while (allocSuccess);

std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
return sizeToAlloc - base;

}

I would expect ~3.6 for my GPU I am currently testing the code with, if I only have omp_target_alloc then the do while loop never breaks, and if I use omp target enter data clause then I got a runtime error, that is impossible to catch, an example output:

Try to alloc: 0.268435 GB
Try to alloc: 0.536871 GB
Try to alloc: 0.805306 GB
Try to alloc: 1.07374 GB
Try to alloc: 1.34218 GB
Out of memory allocating 1342177280 bytes of device memory
Failing in Thread:1
total/free CUDA memory: 4091478016/1325793280
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.6, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7fb5f9fff010 device:0x7fb5b8000000 size:1073741824 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb66bfff010 device:0x7fb63a000000 size:805306368 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb6bdfff010 device:0x7fb69c000000 size:536870912 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb71e7ff010 device:0x7fb6e2000000 size:268435456 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
allocated block device:0x7fb5b8000000 size:1073741824 thread:1
allocated block device:0x7fb63a000000 size:805306368 thread:1
allocated block device:0x7fb69c000000 size:536870912 thread:1
allocated block device:0x7fb6e2000000 size:268435456 thread:1
call to cuMemAlloc returned error 2: Out of memory

In OpenMP standard it says that if the memory cannot be allocated it should return NULL therefore I believe that if the memory cannot be allocated then it should terminate before the target enter data map clause fails.

Could it be the case that omp_target_alloc uses unified memory? Is there a way to prevent omp_target_alloc from using unified memory? and also, is there a way to catch a target enter data map clause that would fail to not fail (and catch that runtime error)

No, we’re not using UM unless you add the “-gpu=managed” flag. Likely our runtime is catching the out-of-memory error and aborting rather than pass it through as a NULL pointer. I’ve filled an issue report, TPR #33798, and sent to engineering for review.

-Mat

I set min = 1024 before, and the code snippet is a function that returns the bytes that we can allocate. (The whole program is just a main method that I used to call the function with the argument 1024)
Let me just paste the whole file here again:

#include <omp.h>
#include <iostream>

size_t allocableSize(size_t base, int deviceId){
    bool allocSuccess = true;

    // Base is the alignment size, any modern gpu will have ~1 gb of memory
    // Then we will keep trying to allocate more and more each try +1 ~1gb
    // The minimum increment will be ~250 mb which a multiple of base
    size_t min = static_cast<size_t>(2.5 * 1e8);
    while (base < min){
        base += base;
    }
    size_t sizeToAlloc = base;

    do {
        char* allocatedMemory = nullptr;
        std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
        try{
            char* allocatedMemory = new char[sizeToAlloc];
            void* u = omp_target_alloc(sizeToAlloc, deviceId);
            if (u == NULL){
                throw std::runtime_error("No memory (omp_target_allocs)");
            } else {
                omp_target_free(u, deviceId);
                u = NULL;
            }
            {
                #pragma omp target enter data map(to: allocatedMemory[0:sizeToAlloc]) device(deviceId)
                {}
            }
            {
                #pragma omp target update from(allocatedMemory[0:sizeToAlloc]) device(deviceId)
                {}
            }

            sizeToAlloc += base;
            delete[] allocatedMemory;
        } catch (int errCode) {
            delete[] allocatedMemory;
            allocatedMemory = nullptr;
            allocSuccess = false;
        }
    } while (allocSuccess);
    
    std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
    return sizeToAlloc - base;
}


int main(){
    allocableSize(1024, 0);
}

Furthermore I compiled it with the following command:
nvc++ -mp=gpu -std=c++20 alloc_omp.cpp -o allocomp

I have implemented the same approach with CUDA, it works as expected. (I know asking for free memory in CUDA API would be easier but I wanted to mimic what I did for OpenMP in behavior)

#include <cuda_runtime.h>
#include <iostream>
#include <stdio.h>

size_t allocableSize(size_t base, int deviceId){
    bool allocSuccess = true;

    // Base is the alignment size, any modern gpu will have ~1 gb of memory
    // Then we will keep trying to allocate more and more each try +1 ~1gb
    // The minimum increment will be ~250 mb which a multiple of base
    size_t min = static_cast<size_t>(2.5 * 1e8);
    while (base < min){
        base += base;
    }
    size_t sizeToAlloc = base;

    do {
        void* allocatedMemory = nullptr;
        //std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
        try{
            cudaError_t code = cudaMalloc(&allocatedMemory, sizeToAlloc);
            if (code != cudaSuccess) {
                allocSuccess = false;
            } else {
                sizeToAlloc += base;
            }
            cudaFree(allocatedMemory);
        } catch (...) {
            cudaFree(allocatedMemory);
            allocSuccess = false;
        }
    } while (allocSuccess);

    //std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
    return sizeToAlloc - base;
}

int main(){
    allocableSize(1024, 0);
}

The command used to compile:
nvcc alloc_cuda.cu -o alloccuda

You’re missing an “target exit data” directive so not freeing the device copy of “allocatedMemory”.

Add the following after the update directive:

#pragma omp target exit data map(delete: allocatedMemory) device(deviceId)

Hi budanaz.yakup,

Engineer has let me know that TPR #33798 should be fixed in our 23.7 release.

-Mat