Unified Memory Allocation Alignment on Windows

mquinlan · August 23, 2020, 4:16am

Hello all,

I’m working on an application that is very memory intensive, running on the Windows OS and am hoping to leverage Unified Memory to allow over-subscription. The application is running out of memory before my back-of-the-envelope calculations suggested that it would.

It seems like problem arises because each allocation consumes its own memory page. This talk suggested that Managed/Unified Memory alignment should be 512 bytes on modern GPUs.

Talk

A minimal reproducible example is provided in the code block below.

TEST(Managed, MemorySize) {
	size_t origfree, origtotal;
	size_t postfree, posttotal;
	checkCuda(cudaMemGetInfo(&origfree, &origtotal));
	void *ptr;
	size_t len = 254;
    checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
	checkCuda(cudaMemGetInfo(&postfree, &posttotal));
	EXPECT_LT(origfree-postfree, 4*1024); //assertion failed LHS == 1048576 (Kepler) or 2097152 (sm 6.1 & sm 7.5)
	origfree = postfree;
	origtotal = posttotal;
	//intentional leak
	checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
	checkCuda(cudaMemGetInfo(&postfree, &posttotal));
	EXPECT_LT(origfree-postfree,  4*1024); //assertion passed
	for(long i=0;i<100;i++) {
		origfree = postfree;
		origtotal = posttotal;
		checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
		checkCuda(cudaMemGetInfo(&postfree, &posttotal));
        // The following assertion fails on specific iterations
        //  for i == [14, 30, 46, 62,78, 94] with LHS == 1048576 on Kepler
        //  for i == [30, 62, 94] with LHS == 2097152 on sm 6.1 and sm 7.5
		EXPECT_LT(origfree-postfree, 4*1024); //assertion failed
		if(origfree-postfree > 4*1024) printf("Iteration %ld failed\n", i);
	}
}

My interpretation is that each call to CudaMallocManaged is consuming a full memory page. The allocator consumes 1MB or 2MBs at a time, which seems to be an allowed discrepancy arising afforded to the implementation.

One odd observation is that a machine containing two NVIDIA GPUs does not consume GPU memory (the test passes). I suspect that a heuristic is causing this to be the case.

Could someone please confirm if 1 page per allocation request is expected behavior on Windows? Also is it possible to set the heuristic deterministically via an API call?

For clarity the output of the code (Using Google Test) is below.

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
Selected device with sm=7.5
[----------] 1 test from Managed
[ RUN      ] Managed.MemorySize
C:\Users\Michael\app\test_Managed.cpp(19): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 30 failed
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 62 failed
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 94 failed
[  FAILED  ] Managed.MemorySize (3851 ms)
[----------] 1 test from Managed (3852 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3890 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Managed.MemorySize
 1 FAILED TEST

Topic		Replies	Views
cudaMallocManaged malloc memory not same as requested CUDA Programming and Performance	7	802	April 17, 2019
Question about GPU Memory Overhead with Cudamallocmanaged CUDA Programming and Performance	7	1010	August 21, 2024
Extreme performance degradation (<1/100) due to allocate unified memory area? CUDA Programming and Performance	0	512	June 19, 2017
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3210	July 13, 2018
Unified memory and overprovisioning CUDA Programming and Performance cuda	5	1414	March 6, 2022
Unified Memory: nvidia-smi "Memory Usage" interpretation CUDA Programming and Performance cuda	6	14380	June 27, 2023
CUDA Is allocating much more GPU memory than expected CUDA Programming and Performance	7	64	April 8, 2025
sth wierd about managed memory and free GPU memeory size CUDA Programming and Performance	2	650	November 25, 2019
cudaMallocManaged suceeds, but memory access fails, for size greater than hardware memory CUDA Programming and Performance	2	1142	April 26, 2017
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8812	June 1, 2022

Unified Memory Allocation Alignment on Windows

Related topics