when CUDA kernels use local arrays, and when those arrays cannot be turned into register accesses via optimizations, then such arrays are usually spilled into local memory (a special access mode for global device memory).
I would like to learn more about how CUDA manages this memory region, and how is can ensure that there is enough of it to service a kernel launch. In fact, I have quite a lot of questions! :-)
When I load and launch a kernel that uses “X” bytes of local memory, “Y” threads, and “Z” blocks, is the allocated memory region proportional to X*Y*Z? Or is only proportional to X and some maximum number of threads that can be simultaneously active on the device?
Is the local memory region allocated when I load the kernel, or when I launch it? Is it every freed?
What if there are two kernels and they both use local memory? Can they share the allocation?
When launching 2 kernels that both use local memory on separate streams, does this need additional memory?
Allocating memory via cuMemAlloc tends to be rather slow because it synchronizes the CPU+GPU. Does the allocation of local memory region similarly synchronize the CPU + GPU?
The context of these questions is that I am thinking of writing a kernel that uses quite a bit of local memory, and I am wondering if/how this may interfere with various other parts of a big system. (For example, by consuming memory beyond the kernel launch, or by serializing normally asynchronous operations)
When you launch it. It’s freed when the kernel has finished and it is needed by another kernel.
No, not shared. (how would that work, anyway?) Local memory requirements could delay a kernel execution until resources are available.
A curious question. If you don’t launch two kernels on separate streams, they cannot run at the same time. If two kernels are launched on separate streams, and they both use local memory, and you actually witness concurrency, then it means that there was sufficient local memory for both kernels to run. If not, see other answers here.
Not sure what you mean. cuMemAlloc has a “synchronizing” character in that it waits for device activity to be complete before it proceeds. In many cases, launching a kernel is similar - it waits for the device to be ready to accept the kernel.
All thread-local data resides in (thread) local memory by default. As an optimization, the compiler moves (at its discretion) some of that data into registers. Some thread-local data may remain in local memory, and that is often the case for arrays. No spilling is involved.
Spilling occurs when the compiler intends to move a data object into registers, and initially does so, only to find out later that there are not enough registers available, so it temporarily spills some data back into local memory, then puts it into registers again later. Usually it does that intelligently, e.g. in outer loops only, or with data that is accessed infrequently, so as to minimize the negative performance impact.
This is actually news to me, so let me follow up on this point to make sure I am not confusing things. Please let me know if there are mistakes in the following:
My assumption was that there are certain operations that need to rewrite the memory map on the GPU: uploading new machine code to the GPU, allocating a global memory memory region using the traditional allocation routines (cuMemAlloc). These operations cannot proceed with concurrent activity on the GPU, so they effectively cuCtxSynchronize(), make the needed changes, and then resume processing. More recently, CUDA added stream-ordered allocation routines that in principle have the ability to circumvent this behavior.
In my application, it’s very useful if the CPU is never stalled waiting for the GPU since it needs to do some calculations to figure out what the next kernel launch should be. So I am generally wary of operations that synchronize, or which have the potential of synchronizing the CPU+GPU.
My understanding of a kernel launch is that there isn’t a reason for it to synchronize. All we are doing is to append an item to a work queue, without expecting back a result right away. (In contrast to cuMemAlloc which returns a pointer or cuModuleLoadData which returns a CUmodule which presumably also wraps a GPU memory address to be determined by the operation.)
What my question at the end of the day boils down to is whether by using local memory, I’m increasing the risk of such stalls due to local memory allocation where the CPU has to wait for the GPU. Or in more technical terms: does the use of local memory increase the likelihood of cuLaunchKernel() having to wait for the CPU (e.g. by a recursive call to cuCtxSynchronize())
I wanted to just be crystal-clear with the response I gave above, because I can see that I confused you with the previous one before it.
on these topics, “synchronization” can be considered in a device activity scope, and/or in a host thread scope. I usually try to use “synchronizing” when I am referring to the device scope and “blocking” when I refer to the host thread scope, but that is not a wide convention, and I probably make mistakes from time to time.
A kernel launch is “almost” always a non-blocking activity, i.e. not synchronizing with respect to the host thread, and this is commonly referred to as “asynchronous kernel launch”, which is a basic CUDA concept. However a kernel does not necessarily begin executing right away. In that respect it may require other device activity to complete before it can begin executing, perhaps because of stream semantics, or perhaps because of occupancy considerations. In this respect, a kernel launch may have a “synchronizing” type behavior, when looking at device scope/activity. Local memory usage could indeed affect when a kernel actually begins executing. But the CPU thread at the launch point will “almost” always experiencing non-blocking, i.e. an “asynchronous launch”.
In current NVIDIA GPU architectures/driver the local memory allocation is context wide and is calculated as MAX_THREADS_PER_SM x SMs x LOCAL_MEM_PER_THREAD. The allocation is shared by all grid launches for the context. If a grid launch requires a larger allocation than the current local memory allocation, then the LMEM allocation needs to be resized. This can result in a synchronization.