Unified Memory Allocation Alignment on Windows

Hello all,

I’m working on an application that is very memory intensive, running on the Windows OS and am hoping to leverage Unified Memory to allow over-subscription. The application is running out of memory before my back-of-the-envelope calculations suggested that it would.

It seems like problem arises because each allocation consumes its own memory page. This talk suggested that Managed/Unified Memory alignment should be 512 bytes on modern GPUs.

Talk

A minimal reproducible example is provided in the code block below.

TEST(Managed, MemorySize) {
	size_t origfree, origtotal;
	size_t postfree, posttotal;
	checkCuda(cudaMemGetInfo(&origfree, &origtotal));
	void *ptr;
	size_t len = 254;
    checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
	checkCuda(cudaMemGetInfo(&postfree, &posttotal));
	EXPECT_LT(origfree-postfree, 4*1024); //assertion failed LHS == 1048576 (Kepler) or 2097152 (sm 6.1 & sm 7.5)
	origfree = postfree;
	origtotal = posttotal;
	//intentional leak
	checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
	checkCuda(cudaMemGetInfo(&postfree, &posttotal));
	EXPECT_LT(origfree-postfree,  4*1024); //assertion passed
	for(long i=0;i<100;i++) {
		origfree = postfree;
		origtotal = posttotal;
		checkCuda(cudaMallocManaged(&ptr, len, cudaMemAttachHost));
		checkCuda(cudaMemGetInfo(&postfree, &posttotal));
        // The following assertion fails on specific iterations
        //  for i == [14, 30, 46, 62,78, 94] with LHS == 1048576 on Kepler
        //  for i == [30, 62, 94] with LHS == 2097152 on sm 6.1 and sm 7.5
		EXPECT_LT(origfree-postfree, 4*1024); //assertion failed
		if(origfree-postfree > 4*1024) printf("Iteration %ld failed\n", i);
	}
}

My interpretation is that each call to CudaMallocManaged is consuming a full memory page. The allocator consumes 1MB or 2MBs at a time, which seems to be an allowed discrepancy arising afforded to the implementation.

One odd observation is that a machine containing two NVIDIA GPUs does not consume GPU memory (the test passes). I suspect that a heuristic is causing this to be the case.

Could someone please confirm if 1 page per allocation request is expected behavior on Windows? Also is it possible to set the heuristic deterministically via an API call?

For clarity the output of the code (Using Google Test) is below.

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
Selected device with sm=7.5
[----------] 1 test from Managed
[ RUN      ] Managed.MemorySize
C:\Users\Michael\app\test_Managed.cpp(19): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 30 failed
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 62 failed
C:\Users\Michael\app\test_Managed.cpp(31): error: Expected: (origfree-postfree) < (4*1024), actual: 2097152 vs 4096
Iteration 94 failed
[  FAILED  ] Managed.MemorySize (3851 ms)
[----------] 1 test from Managed (3852 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3890 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Managed.MemorySize
 1 FAILED TEST