Program stucks on cudaErrorMemoryAllocation after failing a cudaMallocHost

Hi, in my application we calls cudaMallocHost, the quantity of allocated memory depends on user’s chooses, so it can happens that the allocation fails because went out of memory, when it happens the call returns cudaErrorMemoryAllocation, in this case we call cudaFreeHost that returns cudaErrorMemoryAllocation as well (because it may also return error codes from previous, asynchronous launches as documented).

My problem is that from now until the end of program any cudaMallocHost fails, even the littlest, did the cudaFreeHost fail? How can I recover from this situation?

For example: I am trying to allocate 10 buffer of 1Gb, it fails at 8th, with cudaErrorMemoryAllocation so I call cudaFreeHost for the first 7 and all of them return cudaErrorMemoryAllocation (?), then I try to allocate 512Kb, and it keeps fails.

here a sample code:

#include <cuda_runtime.h>
#include <iostream>

#define N_ALLOCATION 10
#define SIZE (size_t)2 * (size_t)1024 * (size_t)1024 * (size_t)1024

bool MyCheck(const cudaError_t &err)
{
    if(err == cudaSuccess) return true;
    std::cout << "Error " << err << std::endl;
    return false;
}

int main()
{
    void* ptr[N_ALLOCATION] = {nullptr};
    for(int i = 0; i < N_ALLOCATION; ++i)
    {
        std::cout << "allocation " << i << std::endl;
        if(!MyCheck(cudaMallocHost(&(ptr[i]), SIZE)))
            break;
    }
    for(int i = 0; i < N_ALLOCATION; ++i)
    {
        if(ptr[i])
        {
            std::cout << "free " << i << std::endl;
            if(MyCheck(cudaFreeHost(ptr[i])))
            {
                ptr[i] = nullptr;
            }
        }
    }
    std::cout << "allocation after free" << std::endl;
    MyCheck(cudaMallocHost(&(ptr[0]), SIZE));
}

in my computer its output is:

allocation 0
allocation 1
allocation 2
allocation 3
allocation 4
allocation 5
allocation 6
allocation 7
Error 2
free 0
free 1
Error 2
free 2
Error 2
free 3
Error 2
free 4
Error 2
free 5
Error 2
free 6
Error 2
allocation after free
Error 2

My expect ion is that the “allcation after free” and the frees do not return errors.

Thanks,
Perry

PS. I copyied my stackoverflow question because i am still stuck on this.

This is pretty much the same report as here, correct?

If so, the behavior I observe on linux is that the error is not sticky. I think you are on windows. I would still expect the non-sticky behavior, but haven’t tested on windows. Knowing that bugs are always possible, I would suggest that you update to latest driver and CUDA versions, and retest.

I am unable to reproduce this issue on my Windows 10 Pro platform with driver version 552.23:

>hostmalloc_issue
allocation 0
allocation 1
allocation 2
allocation 3
allocation 4
allocation 5
allocation 6
allocation 7
Error 2
free 0
free 1
free 2
free 3
free 4
free 5
free 6
allocation after free

correct, it is the same report.
It is on windows, I tested with cuda 12.2 and cuda 12.4, and with the latest drivers.
I don’t have access to my work pc right now, so I can’t check the driver version.

It seems likely to me that the report is unique or specific to your setup, not a general issue with windows usage of CUDA, for example based on the report from njuffa.

Therefore it would probably be necessary to modify setup components via trial and error to find the culprit. I don’t think it is expected/typical behavior in CUDA, and it seems to be not readily reproducible in other setups.

I had this issue on several system where we installed the same driver. probably it is something it was solved at some point. i will report detaily tomorrow.

I am no trying to be facetious here: Try rebooting the system before trying anything else. Yes, this is like “percussive maintenance” of hardware components, but for otherwise inexplicable issues on Windows it is at times effective.

There is certainly a possibility that the issue observed is due to a particular combination of software components.

many thanks for your effort. If you see the stack overflow topic, it is from the start of the month. I remember that i installed the new driver and tried another cuda sdk. As i said, i will report them detaily tomorrow. I am pretty sure that i downloaded the last driver near the 10th of august, and i tried to use 12.2 and 12.4 versions of cuda sdk.

The NVIDIA driver package I have installed on my Windows system is certainly not the latest. It reports that it supports CUDA 12.4, while the latest CUDA version is 12.6. I am not a friend of updating installed drivers unless I have a specific need to do so. I have also not installed the latest Windows patches yet; I usually hold off installing the regularly-scheduled Microsoft patches for two to three weeks.

In addition to the versions of NVIDIA software components the version of Windows may be significant. Within the past two years I have had at least one CUDA-accelerated application that stopped working after a regular Windows update, and I could not get it to work again. It seemed to fail silently deep within the Windows software stack. It is working again now after I tried a few weeks ago.

cudaMallocHost is a fairly thin wrapper around operating system API calls, so it seems certainly possible that something could go wrong at the Windows level that contributes to the issue you are observing.

Hello,
looks like if the system is just started it doesn’t happens.
I tried adding a piece of code on start of application:

int main()
{
    uint32_t deviceCount;
    int32_t devices[8];
    //MyCheck(cudaGLGetDevices(&deviceCount, devices, 8, cudaGLDeviceListAll));

And the issue happens, as you can see the cuda call is commented out, it is because it gives error (I think it is because there is no OpenGL or windows), the strange thing is that after i commented it out, the problem is still present!

The first time I saw this issue i has the driver 537.13, now I have the 552.86. My laptop has a “NVIDIA RTX A3000 12GB Laptop GPU”

Regards

Sorry, for me your descriptions are now unclear.

  1. If you first restart your system, then run exactly the code you showed at the beginning of this post (same as the code you showed on your SO post), are you saying in that situation the problem does not occur?

  2. Then could you explain, with complete code samples, what worked and what didn’t after that?

Sorry, I just tried again.
After the system restart, it works as expected… after a while that I’m using the pc, the issue happens. I don’t know what cause it.

IT IS NOT cudaGLGetDevice

The complete code sample is present in the first post.

It is difficult to track down the source of this kind of error. Basically, at some point the system gets into a “bad state” and no longer works as expected after that.

Can we get a more precise definition of “after a while”? If you reboot the system and then execute your test app over and over again with a script, after how many iterations do you observe failure?

Since this is a laptop, I note that a common way in which such systems get into a “bad state” with respect to drivers is when they go through sleep / wake cycles. I have been bitten by this enough times on Windows systems that I habitually turn off all power saving features. This is obviously a crude workaround rather than a real solution, and it won’t be feasible if the machine is used in truly mobile fashion where it needs to run off the battery.