Maximum size of memory block in cudaMallocManaged()

mbchtr97 · November 28, 2017, 12:20am

Dear all,

I am writing a massive parallel computing app (parsing through 2^512 large bit field of possible combinations) in the works and where it stands right now, with 1024 threads and 64 blocks on 780Ti, I am pegging ~56% of GPU utilization. I did some testing and the main problem is related with exchanging data between GPU and CPU.

My largest data array right now is 66 x 1024 x 64 bits large (~520kB) and allocated as such

bool * varSequence = NULL;
cudaMallocManaged(&varSequence , varGpuBlockSize * varSequenceSize * sizeof(bool));

where varGpuBlockSize = 1024 * 64 and varSequenceSize = 66

Any attempts to increase the size of this variable past 1024kB end up with application crash and memory overruns as signalled by Nsight tool. Debug message indicates “memory unreadable” - a very helpful note, but without any explanation as to what causes it. Reading through different threads online, fora, documentation, notes. etc., it seems that it is related with the limited memory that can be allocated using cudaMallocManaged command. Again, I tried to look for more data on cudaMallocManaged and how it allocates memory and whether there are any GPU specific limits, but documentation is not there.

So, what is with cudaMallocManaged? How much memory can I allocate in a block and pass to device for execution at a time? The only way to further optimize performance and eliminate CPU - GPU chatter is to put more data into GPU accessible memory at a time and limit pipelining …

Any pointers would be very much appreciated. Thank you in advance !

Robert_Crovella · November 28, 2017, 12:36am

There aren’t any limits on cudaMallocManaged in the vicinity of 1024kB i.e. 1MB. Depending on your platform, you should be OK with gigabyte-sized allocations.

application crash is not a very useful term. Nor is memory overrun. If by application crash you mean a seg fault, that is happening in host code.

Are you using proper CUDA error checking, rigorously? (not sure what that is? google it.)
First, eliminate all errors reported by CUDA error checking, and/or any host code errors (e.g. seg faults). If you are getting a kernel execution error that you don’t understand, proceed to step 3.
Run your code with cuda-memcheck (the memory checker built into Nsight VSE is similar) and use the output to identify faults in your device code.

cuda-memcheck is a very powerful tool in that it can localize many kinds of device code errors to a single line of your kernel code, following the methodology here:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

mbchtr97 · November 28, 2017, 1:00am

I understand that “application crash” is not very handy but that is all that I get from Nsight 5.4.0.17240 running under Visual C++ 2017:

CUDA context created : 2b226ba3690
CUDA module loaded: 2b22b920af0 cuModuleLoadDataEx
CUDA grid launch failed: CUcontext: 2964177172112 CUmodule: 2964258425584 Function: _Z22subIntializeDelimitersPbiii
CUDA context created : 2717f3180a0
CUDA module loaded: 271059afc20 kernel.cu
CUDA grid launch failed: CUcontext: 2686488510624 CUmodule: 2684448603168 Function: _Z22subIntializeDelimitersPbiiii
CUDART error: cudaLaunch returned cudaErrorLaunchFailure

CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure

Function subIntializeDelimiters works on the initially allocated memory block using cudaMallocManaged. The crash above comes from 1024 x 64 x 66 x 10 * sizeof(bool) = 5.15625 MB. On second execution, Nsight crashed my system altogether (so much for debugging).

Robert_Crovella · November 28, 2017, 1:05am

So you have a failing kernel launch. Start with proper CUDA error checking, to make sure there are no errors occurring prior to the kernel launch. After that, proceed with steps 2 and 3 that I outlined on the kernel launch in question.

When cudaLaunch returns such a failure, it looks to me like you have a launch configuration problem. (e.g. more than 1024 threads per block). But that is just a guess without looking at the code. If you have a kernel execution failure (as opposed to a launch failure) then it usually means a problem a kernel code.

mbchtr97 · November 28, 2017, 1:08am

The launch was subIntializeDelimiters <<<64,1024>>> (…parameters…) on 780Ti (when it crashed last time). A launch with <<<1,1024>>> similarly crashes the driver and causes OS crash in the process. I will switch to smaller numbers, perhaps <<<1,32>>> but that brings back the question on block and thread count that I should be using for good paralelisation.

Robert_Crovella · November 28, 2017, 1:12am

Have you disabled or modified the windows WDDM TDR mechanism? Not sure what that is? Google “cuda wddm tdr”

Kernels that run for more than about 2 seconds on windows on a GPU in WDDM driver mode, will run into the WDDM TDR watchdog, and will be terminated (unless you modify the timeout).

mbchtr97 · November 28, 2017, 1:17am

I have done no changes to TDR (read about it, but did not touch it in any way, form or fashion). I am running stock drivers, stock Nsight (latest version) and latest CUDA framework with no customizations, mods, etc.

My kernel execution time is sub 1ms so I do not know why I would be hitting any watchdog problems, especially with 2 second watchdog. That is 3 orders of magnitude longer than a single kernel execution.

I did notice that every time I enable Nsight memory debugging, the system stalls, period. I will try to see what I can find out without memory debug and whether the OS stays up. A call <<<1,32>>> just crashed the system.

mbchtr97 · November 28, 2017, 11:04pm

I will continue in one thread since it seems to converge anyway: https://devtalk.nvidia.com/default/topic/1026825/cuda-programming-and-performance/how-to-choose-how-many-threads-blocks-to-have-/post/5222968/.

Thank you txbob for all the help & patience so far.

Topic		Replies	Views
Conceptual questions on memory and blocks limit CUDA Programming and Performance	2	765	May 29, 2018
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3431	July 13, 2018
Struct allocated with cudaMallocManaged fails up from a certain size CUDA Programming and Performance	4	107	February 4, 2025
Allocating Class members with cudaMallocManaged CUDA Programming and Performance	4	1340	January 11, 2018
Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100 CUDA Programming and Performance	23	3382	June 25, 2017
Maximum memory allocation size CUDA Programming and Performance	7	17061	January 24, 2012
Question about GPU Memory Overhead with Cudamallocmanaged CUDA Programming and Performance	7	1166	August 21, 2024
cudaMallocManaged suceeds, but memory access fails, for size greater than hardware memory CUDA Programming and Performance	2	1208	April 26, 2017
Why does access cudaMallocManaged memory throw exception? CUDA Programming and Performance	2	116	June 17, 2025
About cudaMallocManaged() granularity CUDA Programming and Performance	2	533	April 14, 2023

Maximum size of memory block in cudaMallocManaged()

Related topics