I am writing a massive parallel computing app (parsing through 2^512 large bit field of possible combinations) in the works and where it stands right now, with 1024 threads and 64 blocks on 780Ti, I am pegging ~56% of GPU utilization. I did some testing and the main problem is related with exchanging data between GPU and CPU.
My largest data array right now is 66 x 1024 x 64 bits large (~520kB) and allocated as such
bool * varSequence = NULL;
cudaMallocManaged(&varSequence , varGpuBlockSize * varSequenceSize * sizeof(bool));
where varGpuBlockSize = 1024 * 64 and varSequenceSize = 66
Any attempts to increase the size of this variable past 1024kB end up with application crash and memory overruns as signalled by Nsight tool. Debug message indicates “memory unreadable” - a very helpful note, but without any explanation as to what causes it. Reading through different threads online, fora, documentation, notes. etc., it seems that it is related with the limited memory that can be allocated using cudaMallocManaged command. Again, I tried to look for more data on cudaMallocManaged and how it allocates memory and whether there are any GPU specific limits, but documentation is not there.
So, what is with cudaMallocManaged? How much memory can I allocate in a block and pass to device for execution at a time? The only way to further optimize performance and eliminate CPU - GPU chatter is to put more data into GPU accessible memory at a time and limit pipelining …
Any pointers would be very much appreciated. Thank you in advance !
There aren’t any limits on cudaMallocManaged in the vicinity of 1024kB i.e. 1MB. Depending on your platform, you should be OK with gigabyte-sized allocations.
application crash is not a very useful term. Nor is memory overrun. If by application crash you mean a seg fault, that is happening in host code.
- Are you using proper CUDA error checking, rigorously? (not sure what that is? google it.)
- First, eliminate all errors reported by CUDA error checking, and/or any host code errors (e.g. seg faults). If you are getting a kernel execution error that you don’t understand, proceed to step 3.
- Run your code with cuda-memcheck (the memory checker built into Nsight VSE is similar) and use the output to identify faults in your device code.
cuda-memcheck is a very powerful tool in that it can localize many kinds of device code errors to a single line of your kernel code, following the methodology here:
I understand that “application crash” is not very handy but that is all that I get from Nsight 22.214.171.12440 running under Visual C++ 2017:
CUDA context created : 2b226ba3690
CUDA module loaded: 2b22b920af0 cuModuleLoadDataEx
CUDA grid launch failed: CUcontext: 2964177172112 CUmodule: 2964258425584 Function: _Z22subIntializeDelimitersPbiii
CUDA context created : 2717f3180a0
CUDA module loaded: 271059afc20 kernel.cu
CUDA grid launch failed: CUcontext: 2686488510624 CUmodule: 2684448603168 Function: _Z22subIntializeDelimitersPbiiii
CUDART error: cudaLaunch returned cudaErrorLaunchFailure
CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure
Function subIntializeDelimiters works on the initially allocated memory block using cudaMallocManaged. The crash above comes from 1024 x 64 x 66 x 10 * sizeof(bool) = 5.15625 MB. On second execution, Nsight crashed my system altogether (so much for debugging).
So you have a failing kernel launch. Start with proper CUDA error checking, to make sure there are no errors occurring prior to the kernel launch. After that, proceed with steps 2 and 3 that I outlined on the kernel launch in question.
When cudaLaunch returns such a failure, it looks to me like you have a launch configuration problem. (e.g. more than 1024 threads per block). But that is just a guess without looking at the code. If you have a kernel execution failure (as opposed to a launch failure) then it usually means a problem a kernel code.
The launch was subIntializeDelimiters <<<64,1024>>> (…parameters…) on 780Ti (when it crashed last time). A launch with <<<1,1024>>> similarly crashes the driver and causes OS crash in the process. I will switch to smaller numbers, perhaps <<<1,32>>> but that brings back the question on block and thread count that I should be using for good paralelisation.
Have you disabled or modified the windows WDDM TDR mechanism? Not sure what that is? Google “cuda wddm tdr”
Kernels that run for more than about 2 seconds on windows on a GPU in WDDM driver mode, will run into the WDDM TDR watchdog, and will be terminated (unless you modify the timeout).
I have done no changes to TDR (read about it, but did not touch it in any way, form or fashion). I am running stock drivers, stock Nsight (latest version) and latest CUDA framework with no customizations, mods, etc.
My kernel execution time is sub 1ms so I do not know why I would be hitting any watchdog problems, especially with 2 second watchdog. That is 3 orders of magnitude longer than a single kernel execution.
I did notice that every time I enable Nsight memory debugging, the system stalls, period. I will try to see what I can find out without memory debug and whether the OS stays up. A call <<<1,32>>> just crashed the system.