Omp_target_alloc never returns NULL

budanaz.yakup · June 18, 2023, 12:45pm

Hi,

I have a code snippet that tries to allocate more memory every try and returns the max amount of pinnable memory, it is relatively easy to do with the CUDA API (asking for free memory), but I need to implement it without CUDA API for other backends such as OpenMP.

We are using the Nvidia NVHPC compiler suite.
The code looks as follows:

size_t allocableSize(size_t base, int deviceId){
bool allocSuccess = true;

// Base is the alignment size, any modern gpu will have ~1 gb of memory
// Then we will keep trying to allocate more and more each try +1 ~1gb
// The minimum increment will be ~250 mb which a multiple of base
size_t min = static_cast<size_t>(2.5 * 1e8);
while (base < min){
    base += base;
}
size_t sizeToAlloc = base;

do {
    char* allocatedMemory = nullptr;
    std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
    try{
        char* allocatedMemory = new char[sizeToAlloc];
        void* u = omp_target_alloc(sizeToAlloc, deviceId);
        if (u == NULL){
            throw std::runtime_error("No memory (omp_target_allocs)");
        } else {
            omp_target_free(u, deviceId);
            u = NULL;
        }
        #pragma omp target enter data map(to: allocatedMemory[0:sizeToAlloc]) device(deviceId)
        {}
        #pragma omp target update from(allocatedMemory[0:sizeToAlloc]) device(deviceId)
        {}
        sizeToAlloc += base;
        free(allocatedMemory);
    } catch (int errCode) {
        free(allocatedMemory);
        allocatedMemory = nullptr;
        allocSuccess = false;
    }
} while (allocSuccess);

std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
return sizeToAlloc - base;

}

I would expect ~3.6 for my GPU I am currently testing the code with, if I only have omp_target_alloc then the do while loop never breaks, and if I use omp target enter data clause then I got a runtime error, that is impossible to catch, an example output:

Try to alloc: 0.268435 GB
Try to alloc: 0.536871 GB
Try to alloc: 0.805306 GB
Try to alloc: 1.07374 GB
Try to alloc: 1.34218 GB
Out of memory allocating 1342177280 bytes of device memory
Failing in Thread:1
total/free CUDA memory: 4091478016/1325793280
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.6, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7fb5f9fff010 device:0x7fb5b8000000 size:1073741824 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb66bfff010 device:0x7fb63a000000 size:805306368 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb6bdfff010 device:0x7fb69c000000 size:536870912 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
host:0x7fb71e7ff010 device:0x7fb6e2000000 size:268435456 presentcount:0+1 line:29 name:allocatedMemory[:sizeToAlloc]
allocated block device:0x7fb5b8000000 size:1073741824 thread:1
allocated block device:0x7fb63a000000 size:805306368 thread:1
allocated block device:0x7fb69c000000 size:536870912 thread:1
allocated block device:0x7fb6e2000000 size:268435456 thread:1
call to cuMemAlloc returned error 2: Out of memory

In OpenMP standard it says that if the memory cannot be allocated it should return NULL therefore I believe that if the memory cannot be allocated then it should terminate before the target enter data map clause fails.

Could it be the case that omp_target_alloc uses unified memory? Is there a way to prevent omp_target_alloc from using unified memory? and also, is there a way to catch a target enter data map clause that would fail to not fail (and catch that runtime error)

MatColgrove · June 20, 2023, 5:02pm

No, we’re not using UM unless you add the “-gpu=managed” flag. Likely our runtime is catching the out-of-memory error and aborting rather than pass it through as a NULL pointer. I’ve filled an issue report, TPR #33798, and sent to engineering for review.

-Mat

budanaz.yakup · June 20, 2023, 6:00pm

I set min = 1024 before, and the code snippet is a function that returns the bytes that we can allocate. (The whole program is just a main method that I used to call the function with the argument 1024)
Let me just paste the whole file here again:

#include <omp.h>
#include <iostream>

size_t allocableSize(size_t base, int deviceId){
    bool allocSuccess = true;

    // Base is the alignment size, any modern gpu will have ~1 gb of memory
    // Then we will keep trying to allocate more and more each try +1 ~1gb
    // The minimum increment will be ~250 mb which a multiple of base
    size_t min = static_cast<size_t>(2.5 * 1e8);
    while (base < min){
        base += base;
    }
    size_t sizeToAlloc = base;

    do {
        char* allocatedMemory = nullptr;
        std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
        try{
            char* allocatedMemory = new char[sizeToAlloc];
            void* u = omp_target_alloc(sizeToAlloc, deviceId);
            if (u == NULL){
                throw std::runtime_error("No memory (omp_target_allocs)");
            } else {
                omp_target_free(u, deviceId);
                u = NULL;
            }
            {
                #pragma omp target enter data map(to: allocatedMemory[0:sizeToAlloc]) device(deviceId)
                {}
            }
            {
                #pragma omp target update from(allocatedMemory[0:sizeToAlloc]) device(deviceId)
                {}
            }

            sizeToAlloc += base;
            delete[] allocatedMemory;
        } catch (int errCode) {
            delete[] allocatedMemory;
            allocatedMemory = nullptr;
            allocSuccess = false;
        }
    } while (allocSuccess);
    
    std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
    return sizeToAlloc - base;
}


int main(){
    allocableSize(1024, 0);
}

Furthermore I compiled it with the following command:
nvc++ -mp=gpu -std=c++20 alloc_omp.cpp -o allocomp

I have implemented the same approach with CUDA, it works as expected. (I know asking for free memory in CUDA API would be easier but I wanted to mimic what I did for OpenMP in behavior)

#include <cuda_runtime.h>
#include <iostream>
#include <stdio.h>

size_t allocableSize(size_t base, int deviceId){
    bool allocSuccess = true;

    // Base is the alignment size, any modern gpu will have ~1 gb of memory
    // Then we will keep trying to allocate more and more each try +1 ~1gb
    // The minimum increment will be ~250 mb which a multiple of base
    size_t min = static_cast<size_t>(2.5 * 1e8);
    while (base < min){
        base += base;
    }
    size_t sizeToAlloc = base;

    do {
        void* allocatedMemory = nullptr;
        //std::cout << "Try to alloc: " << static_cast<double>(sizeToAlloc * 1e-9) << " GB" << std::endl;
        try{
            cudaError_t code = cudaMalloc(&allocatedMemory, sizeToAlloc);
            if (code != cudaSuccess) {
                allocSuccess = false;
            } else {
                sizeToAlloc += base;
            }
            cudaFree(allocatedMemory);
        } catch (...) {
            cudaFree(allocatedMemory);
            allocSuccess = false;
        }
    } while (allocSuccess);

    //std::cout << static_cast<double>((sizeToAlloc - base) * 1e-9) << " GB can be allocated" << std::endl;
    return sizeToAlloc - base;
}

int main(){
    allocableSize(1024, 0);
}

The command used to compile:
nvcc alloc_cuda.cu -o alloccuda

MatColgrove · June 20, 2023, 10:05pm

You’re missing an “target exit data” directive so not freeing the device copy of “allocatedMemory”.

Add the following after the update directive:

#pragma omp target exit data map(delete: allocatedMemory) device(deviceId)

MatColgrove · July 28, 2023, 8:58pm

Hi budanaz.yakup,

Engineer has let me know that TPR #33798 should be fixed in our 23.7 release.

-Mat

Topic		Replies	Views
Detecting out of memory with OpenMP target nvc, nvc++ and nvfortran	2	420	August 24, 2023
Questions about omp offload and memory transfer nvc, nvc++ and nvfortran	13	1470	October 15, 2021
NVC++ v21.3 ::omp_target_is_present crashes the program nvc, nvc++ and nvfortran	1	384	May 25, 2022
Random segmentation fault Legacy PGI Compilers	12	1305	December 30, 2020
OpenMP + different GTX GPUs + Driver > v391.35 (Win 10 / Win 7) CUDA Programming and Performance	15	1569	August 31, 2018
OpenMP kernel is too big? nvc, nvc++ and nvfortran	3	697	August 14, 2023
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	180	January 17, 2025
cudamalloc not allocating memeory CUDA Programming and Performance	0	1272	May 1, 2012
Bug: NVHPC 25.3 and checking unallocated Fortran arrays in OpenMP target loops nvc, nvc++ and nvfortran	2	40	May 15, 2025
NVHPC 22.11/23.1 -- OMPT methods can cause SegFault when offloading nvc, nvc++ and nvfortran	9	1208	March 31, 2023

Omp_target_alloc never returns NULL

Related topics