Calling a cuBLAS function from within a kernel

Just wondering what exactly happens when you do this. Does the cuBLAS function execute just using the resources of that thread, or does this kick off a totally separate kernel, as it would be when you call a cuBLAS function from the host?

In the same vein, when each thread in a warp call the same cuBLAS function (with the same dimensions) do they all execute in lockstep?

It kicks off a separate kernel. For this reason, a CUBLAS from kernel call must be compiled with the same settings you would use for any other dynamic parallelism code, along with linking against cublas device library.

Each thread (anywhere) that encounters a cublas call will launch a new kernel or kernels. These child kernels do not execute in lockstep. More details are in the programming guide dynamic parallelism section:

So if I wanted each thread to execute some low-level linear algebra operation (say a matrix multiply) on shared memory for that block, and I want to ensure each thread in the warp to perform these actions in lockstep, would I have any option apart from coding it manually in the kernel?

You would have to code it manually. Furthermore, lockstep execution can only be done at the warp level in current architectures and will become de-emphasized in Volta. You should not write GPU algorithms that rely on lockstep execution if you can avoid it. And you can pretty much always avoid it.

Furthermore, parent shared and local memory are “out of scope” for any child kernel usage, including those from cublas. This is also covered in the programming guide. You cannot have a cublas child kernel execute directly on data that is in shared memory of the parent, or in local memory of the parent

That’s interesting about lockstep being de-emphasized in Volta. I’m working through “Professional CUDA C Programming” right now and it emphasizes avoiding warp divergence. My understanding was that threads within the same warp would execute in lockstep so long as each thread followed the same control path, and that this is the key to avoiding performance losses due to warp divergence.

Perhaps I misunderstood though (I’m still a gpu n00b, lol). Is “writing GPU algorithms that rely on lockstep” different from the goal of ensure threads within the same block (or within sub-groups of 32 threads) follow the same control path? And is the latter still a relevant goal under the new architecture?

Yes. The former refers to a technique called “warp-synchronous programming” that (usually for performance reasons) takes advantage of the fact that threads are known to execute in a group of threads called a warp (comprising 32 threads on all GPU architectures currently supported by CUDA). This is an advanced CUDA programming concept, and prone to misapplication by inexperienced CUDA programmers.

The latter concept is “thread divergence”, which is a basic concept in CUDA program execution.

So avoiding thread divergence is not what gives you lockstep execution of threads within a warp?

In that case, how exactly does thread divergence harm performance?

Thread divergence reduces performance because when divergence occurs, its effect is that not all threads in the warp do useful work: some are masked off, i.e. currently inactive. The corresponding execution resources are idle, and cannot be used by threads from other warps. Even so, even when some or most threads in a warp are masked off, the threads in the warp execute in lockstep because there is only one program counter for the entire warp, not one program counter per individual thread as in a classical CPU.

The closest thing to a CUDA “thread” in a CPU are the SIMD lanes in CPUs, which are also maskable in recent x86 architecture versions. The difference of CUDA’s SIMT and classical explicit SIMD is that the SIMDness of the hardware, including the masking of the SIMD lanes, is mostly abstracted away, making it implicit SIMD, which is a lot nicer for programmers to deal with because it provides a single-thread view of program execution most of the time.

The lockstep execution of the threads in a warp gives certain desirable guarantees about the behavior of the threads in the warp relative to each other. These guarantees are being exploited by warp-synchronous programming techniques, but they are frequently misunderstood by less experienced CUDA programmers: many are weaker than they are perceived to be, and they may be difficult to correlate with HLL code. This then causes unexpected program behavior, which may not be immediately obvious, compounding the problem: the affected code seems to work perfectly in some circumstances but not others.

So just to make sure I understand… warp-synchronous coding is using the knowledge that warps run their threads in lock-step to avoid calling __syncthreads() when it would otherwise be necessary to avoid race conditions?

And this is ill-advised because 1) it’s easy for n00bs to flub, and 2) even when done right it’s not future-proof, as warp behaviour may change under future architectures (i.e. Volta) ?

Eliminating __synthreads() overhead is one example, correct. As for the questions: basically yes and maybe :-) It is not clear that Volta will actually break currently used warp-synchronous programming. A number of important applications use it, and I cannot imagine NVIDIA will want to break those, even if the underlying hardware implementation changes.

Ok fair enough.

All this was a bit of a tangent from my original question though (although an interesting one). Let me try to ask my original question again.

Let’s say that my problem is naturally parallelized in a way where each thread wants to perform some basic linear algebra function (say a matrix solve or multiply). I can either 1) hand-code it in the kernel, 2) call out to cuBLAS. For simplicity let’s assume there are no race conditions either way.

Is there an intuition for knowing ex-ante which of these would give better parallelism/performance?

Without a lot more specifics regarding your use case I would suggest to try it either way and see what works best. In many situations the ease of programming is more important than squeezing the last bit of performance so my commonly recommended strategy (for CPUs and GPUs equally) is to try canned libraries first, and revisit that decision if and only if performance proves unsatisfactory.