Cudamalloc attempting to allocate more memory than it is supposed to

SBenner · January 13, 2025, 8:32pm

This is a function of mine to sort a large FEM matrix in COO format. It works for the most part, except with very large datasets (currently around 10GB). I am working on an A30. As you can see, I have it print how much memory it wants for the buffer in step 3.

For the dataset in question, it wants to allocate just over 10GB, when the A30 still has over 13 GB of memory free. It crashes as soon as it tries to allocate in step 4, with an out of memory error. Before it does this however, it completely fills my 128GB of system RAM. Why is it telling me it only wants 10GB, but then proceeds to fill more than double that across the VRAM and system RAM? Is there something obviously wrong with my code? Keep in mind, the crash is in step 4 with the cudamalloc function, wherein I tell it exactly how much to allocate, which is the 10GB.

void sortCOOMatrix(cusparseHandle_t cusparseHandle, int rows, int cols, int nnz, int* d_cooRow, int* d_cooCol, real* d_cooVal) {
    // Step 1: Create arrays for sorted values and permutation, permutation is used for sorting value array after row-column sort
    int* d_permutation;
    real* d_values_sorted;
    void* d_buffer;
    size_t bufferSize;

    // Allocate memory for sorted values and permutation array
    checkCudaError(cudaMalloc((void**)&d_values_sorted, nnz * sizeof(real)), "Failed to allocate memory for sorted values");
    checkCudaError(cudaMalloc((void**)&d_permutation, nnz * sizeof(int)), "Failed to allocate memory for permutation array");

    // Step 2: Create descriptors for sparse and dense vectors
    cusparseSpVecDescr_t vec_permutation;
    cusparseDnVecDescr_t vec_values;

    // Create a sparse vector descriptor for permutation and a dense vector for the original values
    checkCusparseError(cusparseCreateSpVec(&vec_permutation, nnz, nnz, d_permutation, d_values_sorted, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, CUDA_PRECISION), "Failed to create sparse vector descriptor for permutation");
    checkCusparseError(cusparseCreateDnVec(&vec_values, nnz, d_cooVal, CUDA_PRECISION), "Failed to create dense vector descriptor for values");

    // Step 3: Determine buffer size required for sorting
    checkCusparseError(cusparseXcoosort_bufferSizeExt(cusparseHandle, rows, cols, nnz, d_cooRow, d_cooCol, &bufferSize), "Failed to determine buffer size for COO sorting");
    printf("\nSort buffer requires %zu MB of memory\n", bufferSize / (1024*1024));

    // Step 4: Allocate the buffer memory needed for sorting
    checkCudaError(cudaMalloc(&d_buffer, bufferSize), "Failed to allocate buffer for sorting");

    // Step 5: Create an identity permutation array to retain initial order
    checkCusparseError(cusparseCreateIdentityPermutation(cusparseHandle, nnz, d_permutation), "Failed to create identity permutation");

    // Step 6: Sort the COO matrix by rows (and by columns within rows)
    checkCusparseError(cusparseXcoosortByRow(cusparseHandle, rows, cols, nnz, d_cooRow, d_cooCol, d_permutation, d_buffer), "Failed to sort COO matrix by rows");

    // Step 7: Apply the sorted permutation to the values array to reorder them according to sorted row and column indices
    checkCusparseError(cusparseGather(cusparseHandle, vec_values, vec_permutation), "Failed to gather sorted values");

    // Step 8: Copy sorted values back to the original values array
    checkCudaError(cudaMemcpy(d_cooVal, d_values_sorted, nnz * sizeof(real), cudaMemcpyDeviceToDevice), "Failed to copy sorted values back to the original array");

    // Step 9: Clean up allocated resources
    checkCusparseError(cusparseDestroyDnVec(vec_values), "Failed to destroy dense vector descriptor");
    checkCusparseError(cusparseDestroySpVec(vec_permutation), "Failed to destroy sparse vector descriptor");
    checkCudaError(cudaFree(d_values_sorted), "Failed to free sorted values memory");
    checkCudaError(cudaFree(d_buffer), "Failed to free buffer memory");
    checkCudaError(cudaFree(d_permutation), "Failed to free permutation array memory");
}

Curefab · January 13, 2025, 8:40pm

What operating system are you using?

Can you install more system RAM?

SBenner · January 13, 2025, 8:43pm

Windows 11.

I’m not sure if that’s a joke, but if not, it’s not really something I want to be doing. I am telling it to allocate 10GB, and it tries to allocate way more than that. That’s my real issue.

Sorry, I also had a typo. The A30 server has 128 GB of RAM.

njuffa · January 13, 2025, 8:46pm

The GPU is operating with the TCC driver, correct?

SBenner · January 13, 2025, 8:47pm

Yes, that is correct. Though I just tested this at home on my 3070Ti. While the Dataset exceeds the memory of my personal GPU, it still fills my system RAM, which is significantly more than the 10GB it is supposed to allocate. So the issue occurs both with TCC and WDDM drivers.

njuffa · January 13, 2025, 8:52pm

I am confused. To the best of my knowledge, only the WDDM driver is supported with the 3070Ti. At which point CUDA is at the mercy of the operating system’s allocator. The TCC driver should have its own allocator, but I have no specific knowledge about its internal operation.

Can you show the actual output from your program when running with the A30? From what I understand, it prints the expected amount in step 3 and then fails in step 4? Is your application the only one that uses the A30, or is there a possibility that other applications are also using it, reducing the amount of memory avaiable?

SBenner · January 13, 2025, 8:57pm

Sorry, let me try to be clear. The code produces the same behavior both with the 3070Ti running WDDM drivers, and the A30 with TCC drivers.

I currently don’t have access to the A30, as I am now at home, but I can get an image of the issue tomorrow.

Your interpretation of the behavior is correct. In step 3, it says it wants to allocate just over 10GB of data, a size stored in the bufferSize variable. This variable is passed to the cudamalloc function in step 4 to allocate that size in the address stored in d_buffer. This then starts filling up the entirety of the VRAM before beginning to fill system RAM and eventually crashes with the “out of memory” error. The specific error it prints is:

“Error: Failed to allocate buffer for sorting: out of memory”, the first part of the string coming from my own cuda error-check function.

Curefab · January 13, 2025, 9:07pm

Okay, otherwise 24 GB VRAM vs. 32 GB Windows system RAM could have been on the low side comparably. 128 GB should definitely be fine.

njuffa · January 13, 2025, 9:08pm

Based solely on my experience:

The behavior described is somewhat consistent with use of the operating system allocator used by the WDDM driver, which (1) does not allow users to grab more than a certain percentage of total GPU memory in a single allocation (2) allocates backing store in system memory to be available for swapping pages between GPU memory and system memory at its discretion.

The behavior described is not consistent at all with the behavior of the memory allocator of the TCC driver.

I don’t have encyclopedic knowledge of driver behavior under Windows. Bugs are always possible, both on the Windows side and in the CUDA software stack. If all of the latest updates are installed for either side, and the issue is still reproduceable with a minimal reproducer when using the TCC driver, consider filing a bug.

Curefab · January 13, 2025, 9:12pm

Can you allocate 5 GB twice at step 4?

How do you measure free memory?

Is this the beginning of the program, i.e. the memory is not fragmented (not sure, whether that can be a problem)?

Is this the only GPU in the system?
See also

It could be that the pinned buffer needed for unpinned memory copying also depends on the maximum allocation size?

SBenner · January 13, 2025, 9:42pm

Is it possible to work with smaller allocations? The CUDA function in step 6 requires the d_buffer. It’s a function from NVIDIA, so I can’t change how it works in the background.

I measure memory with this function: cudaMemGetInfo(&freeMemStart, &totalMem)

It is very near the beginning. Program begin → read matrix to host → save matrix to device → sort matrix (this is where it crashes)

It is the only GPU in the system.

Maybe I am misunderstanding the post you linked, but this seems to be talking about host memory. I don’t want to be using host memory if possible. It only uses host memory because it is allocating more than it is supposed to. However, can you explain to me the concept of pinned and unpinned memory? I’m unfamiliar with those terms.

SBenner · January 13, 2025, 9:48pm

When using the 3070Ti, which has 8GB of memory, I have no issue working with a dataset with a size of 4.5 GB, which is obviously more than half the available VRAM. Is it then not unlikely that the issue has to do with an allocation limit (I read something about the allocation limit being 4GB or half the VRAM capacity)? If this is the issue, how are data scientists able to work with large datasets? Or is that not an issue because TCC drivers have no such limitations?

Can you tell me what I a minimal reproducer would be in this case? This is pretty close to the beginning of my program which has the structure: Program begin → read matrix to host → save matrix to device → sort matrix (this is where it crashes)

I can’t really think of anything more minimal than reading the large dataset, saving to the GPU, and immediately trying to sort.

Curefab · January 13, 2025, 9:58pm

You wrote that the memory allocation in step 4 failed. To “debug” that issue, it would be interesting, whether two allocations of 5 GB would work at that step. Of course it would not be a workaround for the following steps.

But the actual error does happen in step 4?

Pinned memory

To copy system memory to the GPU, the buffer on the system side has to have a fixed memory address and may not be swapped out by the operating system.

One way to do it, is to allocate pinned host memory from Windows (= the operating system guarantees that that memory is not swapped out, but represented physically).
Another way is that for each cudaMemcpy from host to device, the system memory is first copied to an internal (driver) pinned buffer and from there to the GPU.

That is the reason, why only originally pinned memory can be copied asynchronously. Otherwise different streams would conflict by their use of the internal buffer.

Possible reasons for the error

You observed that allocating memory on Cuda filled the system main memory.

Now one possibility is that if you allocate a rather large chunk of memory (10 GB), the internal pinned buffer needs the same size. (Instead of using a smaller buffer and copying large chunks one after the other).

The linked post mentions that since Windows 10 it is more difficult to pin a reasonable amount of memory.

So perhaps you have to use Windows 7 or earlier ;-)

njuffa · January 13, 2025, 10:03pm

To the best of my knowledge, almost all HPC computing uses Linux systems. Windows systems using the TCC driver should work as an alternative; I am surprised you are running into issues with the TCC driver.

The WDDM driver architecture was designed for consumer-grade systems with the primary goal of providing the operating system with maximum control over the GPU, in order to maximize system stability. With the previous XP driver model issues in graphics drivers could crash or otherwise seriously impact Windows operation, and from what I have heard Microsoft received thousands of the resulting automated crash reports per day.

SBenner · January 13, 2025, 10:05pm

Thanks for the explanation. I’ll try to allocate multiple smaller memory chunks. I should be able to find the point at which cudamalloc fails with that method as well.

Curefab · January 13, 2025, 10:12pm

Another test could be to allocate pinned memory with cudaMallocHost at step 4 and show that it also fails with similar symptoms (system memory is filled up, before it fails).

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b

One actual workaround could be to allocate (nearly) the whole device memory at startup (if this succeeds then) and do the memory management yourself.

If you are lucky the reason for the error is that a large (e.g. 8 GB) pinned buffer was needed in the beginning of the program and with your allocation a larger (e.g. 10 GB) pinned buffer was requested; and only if it succeeded the 8 GB buffer would have been freed.
It could also be that the system memory has fragmentation issues, as the memory has to be in physical (continuous) form?

Also consider the following Microsoft tools

Test to turn of fast start-up to keep the memory cleaner
https://www.tenforums.com/tutorials/4189-turn-off-fast-startup-windows-10-a.html

And deactivate (shut down) the Hyper-V service, if active.

Topic		Replies	Views
Change limit of 50% for cudaHostAlloc pinned memory on Windows 10/11 CUDA Programming and Performance	9	2975	September 19, 2022
cudaMalloc error in big loop CUDA Programming and Performance	12	15611	May 21, 2008
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37700	April 23, 2009
check for cudaHostAlloc Portable possibility CUDA Programming and Performance	13	2779	July 1, 2015
Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM CUDA Programming and Performance	0	1091	June 18, 2014
CudaMalloc on Vista : strange behaviour Works on XP, Fails on Vista CUDA Programming and Performance	6	12258	July 1, 2009
Max amount of host pinned memory available for allocation CUDA Programming and Performance	8	7999	February 4, 2021
cudaMalloc3DArray out of memory can not allocate the available amount of memory CUDA Programming and Performance	3	1812	January 31, 2011
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9596	December 4, 2017
Pinned memory limit CUDA Programming and Performance	16	13520	May 1, 2016

Cudamalloc attempting to allocate more memory than it is supposed to

Related topics