Program stucks on cudaErrorMemoryAllocation after failing a cudaMallocHost

IorPerry · August 22, 2024, 4:22pm

Hi, in my application we calls cudaMallocHost, the quantity of allocated memory depends on user’s chooses, so it can happens that the allocation fails because went out of memory, when it happens the call returns cudaErrorMemoryAllocation, in this case we call cudaFreeHost that returns cudaErrorMemoryAllocation as well (because it may also return error codes from previous, asynchronous launches as documented).

My problem is that from now until the end of program any cudaMallocHost fails, even the littlest, did the cudaFreeHost fail? How can I recover from this situation?

For example: I am trying to allocate 10 buffer of 1Gb, it fails at 8th, with cudaErrorMemoryAllocation so I call cudaFreeHost for the first 7 and all of them return cudaErrorMemoryAllocation (?), then I try to allocate 512Kb, and it keeps fails.

here a sample code:

#include <cuda_runtime.h>
#include <iostream>

#define N_ALLOCATION 10
#define SIZE (size_t)2 * (size_t)1024 * (size_t)1024 * (size_t)1024

bool MyCheck(const cudaError_t &err)
{
    if(err == cudaSuccess) return true;
    std::cout << "Error " << err << std::endl;
    return false;
}

int main()
{
    void* ptr[N_ALLOCATION] = {nullptr};
    for(int i = 0; i < N_ALLOCATION; ++i)
    {
        std::cout << "allocation " << i << std::endl;
        if(!MyCheck(cudaMallocHost(&(ptr[i]), SIZE)))
            break;
    }
    for(int i = 0; i < N_ALLOCATION; ++i)
    {
        if(ptr[i])
        {
            std::cout << "free " << i << std::endl;
            if(MyCheck(cudaFreeHost(ptr[i])))
            {
                ptr[i] = nullptr;
            }
        }
    }
    std::cout << "allocation after free" << std::endl;
    MyCheck(cudaMallocHost(&(ptr[0]), SIZE));
}

in my computer its output is:

allocation 0
allocation 1
allocation 2
allocation 3
allocation 4
allocation 5
allocation 6
allocation 7
Error 2
free 0
free 1
Error 2
free 2
Error 2
free 3
Error 2
free 4
Error 2
free 5
Error 2
free 6
Error 2
allocation after free
Error 2

My expect ion is that the “allcation after free” and the frees do not return errors.

Thanks,
Perry

PS. I copyied my stackoverflow question because i am still stuck on this.

Robert_Crovella · August 22, 2024, 4:28pm

This is pretty much the same report as here, correct?

If so, the behavior I observe on linux is that the error is not sticky. I think you are on windows. I would still expect the non-sticky behavior, but haven’t tested on windows. Knowing that bugs are always possible, I would suggest that you update to latest driver and CUDA versions, and retest.

njuffa · August 22, 2024, 4:38pm

I am unable to reproduce this issue on my Windows 10 Pro platform with driver version 552.23:

>hostmalloc_issue
allocation 0
allocation 1
allocation 2
allocation 3
allocation 4
allocation 5
allocation 6
allocation 7
Error 2
free 0
free 1
free 2
free 3
free 4
free 5
free 6
allocation after free

IorPerry · August 22, 2024, 4:56pm

correct, it is the same report.
It is on windows, I tested with cuda 12.2 and cuda 12.4, and with the latest drivers.
I don’t have access to my work pc right now, so I can’t check the driver version.

Robert_Crovella · August 22, 2024, 5:04pm

It seems likely to me that the report is unique or specific to your setup, not a general issue with windows usage of CUDA, for example based on the report from njuffa.

Therefore it would probably be necessary to modify setup components via trial and error to find the culprit. I don’t think it is expected/typical behavior in CUDA, and it seems to be not readily reproducible in other setups.

IorPerry · August 22, 2024, 5:16pm

I had this issue on several system where we installed the same driver. probably it is something it was solved at some point. i will report detaily tomorrow.

njuffa · August 22, 2024, 5:16pm

I am no trying to be facetious here: Try rebooting the system before trying anything else. Yes, this is like “percussive maintenance” of hardware components, but for otherwise inexplicable issues on Windows it is at times effective.

There is certainly a possibility that the issue observed is due to a particular combination of software components.

IorPerry · August 22, 2024, 5:21pm

many thanks for your effort. If you see the stack overflow topic, it is from the start of the month. I remember that i installed the new driver and tried another cuda sdk. As i said, i will report them detaily tomorrow. I am pretty sure that i downloaded the last driver near the 10th of august, and i tried to use 12.2 and 12.4 versions of cuda sdk.

njuffa · August 22, 2024, 5:43pm

The NVIDIA driver package I have installed on my Windows system is certainly not the latest. It reports that it supports CUDA 12.4, while the latest CUDA version is 12.6. I am not a friend of updating installed drivers unless I have a specific need to do so. I have also not installed the latest Windows patches yet; I usually hold off installing the regularly-scheduled Microsoft patches for two to three weeks.

In addition to the versions of NVIDIA software components the version of Windows may be significant. Within the past two years I have had at least one CUDA-accelerated application that stopped working after a regular Windows update, and I could not get it to work again. It seemed to fail silently deep within the Windows software stack. It is working again now after I tried a few weeks ago.

cudaMallocHost is a fairly thin wrapper around operating system API calls, so it seems certainly possible that something could go wrong at the Windows level that contributes to the issue you are observing.

IorPerry · August 23, 2024, 1:31pm

Hello,
looks like if the system is just started it doesn’t happens.
I tried adding a piece of code on start of application:

int main()
{
    uint32_t deviceCount;
    int32_t devices[8];
    //MyCheck(cudaGLGetDevices(&deviceCount, devices, 8, cudaGLDeviceListAll));

And the issue happens, as you can see the cuda call is commented out, it is because it gives error (I think it is because there is no OpenGL or windows), the strange thing is that after i commented it out, the problem is still present!

The first time I saw this issue i has the driver 537.13, now I have the 552.86. My laptop has a “NVIDIA RTX A3000 12GB Laptop GPU”

Regards

Robert_Crovella · August 23, 2024, 1:38pm

Sorry, for me your descriptions are now unclear.

If you first restart your system, then run exactly the code you showed at the beginning of this post (same as the code you showed on your SO post), are you saying in that situation the problem does not occur?
Then could you explain, with complete code samples, what worked and what didn’t after that?

IorPerry · August 23, 2024, 1:51pm

Sorry, I just tried again.
After the system restart, it works as expected… after a while that I’m using the pc, the issue happens. I don’t know what cause it.

IT IS NOT cudaGLGetDevice

The complete code sample is present in the first post.

njuffa · August 23, 2024, 8:25pm

It is difficult to track down the source of this kind of error. Basically, at some point the system gets into a “bad state” and no longer works as expected after that.

Can we get a more precise definition of “after a while”? If you reboot the system and then execute your test app over and over again with a script, after how many iterations do you observe failure?

Since this is a laptop, I note that a common way in which such systems get into a “bad state” with respect to drivers is when they go through sleep / wake cycles. I have been bitten by this enough times on Windows systems that I habitually turn off all power saving features. This is obviously a crude workaround rather than a real solution, and it won’t be feasible if the machine is used in truly mobile fashion where it needs to run off the battery.

Topic		Replies	Views
Should cudaMallocHost() need retry? CUDA Programming and Performance	4	1380	January 10, 2016
Problems with CudaFreeHost CUDA Programming and Performance	3	1033	September 1, 2015
After cudaMalloc, host stalls CUDA Programming and Performance	0	3218	March 11, 2009
1st call to cudaMallocHost fails... ... but next calls are OK. (!?) CUDA Programming and Performance	1	6195	January 8, 2009
cudaFreeHost throwing first chance exception cudaFreeHost acting strangely when called on different CUDA Programming and Performance	1	1287	December 21, 2011
Problems in allocation memory some time cudaMallocHost block the system CUDA Programming and Performance	0	2830	April 29, 2009
Problem CudaMallocHost CUDA Programming and Performance	4	2175	July 14, 2015
Large memory allocation with CudaHostAlloc fails with CUDA 8.0 release build CUDA Programming and Performance	23	4698	January 29, 2018
Error in cudaFreeHost CUDA Programming and Performance	3	2026	March 1, 2017
host memory allocation failure with GTX 1080 CUDA Programming and Performance	8	1251	September 14, 2016

Program stucks on cudaErrorMemoryAllocation after failing a cudaMallocHost

Related topics