Calling a cuBLAS function from within a kernel

Tuee22 · May 19, 2017, 4:07pm

Just wondering what exactly happens when you do this. Does the cuBLAS function execute just using the resources of that thread, or does this kick off a totally separate kernel, as it would be when you call a cuBLAS function from the host?

In the same vein, when each thread in a warp call the same cuBLAS function (with the same dimensions) do they all execute in lockstep?

Robert_Crovella · May 19, 2017, 4:27pm

It kicks off a separate kernel. For this reason, a CUBLAS from kernel call must be compiled with the same settings you would use for any other dynamic parallelism code, along with linking against cublas device library.

Each thread (anywhere) that encounters a cublas call will launch a new kernel or kernels. These child kernels do not execute in lockstep. More details are in the programming guide dynamic parallelism section:

[url]Programming Guide :: CUDA Toolkit Documentation

Tuee22 · May 19, 2017, 4:37pm

So if I wanted each thread to execute some low-level linear algebra operation (say a matrix multiply) on shared memory for that block, and I want to ensure each thread in the warp to perform these actions in lockstep, would I have any option apart from coding it manually in the kernel?

Robert_Crovella · May 19, 2017, 4:47pm

You would have to code it manually. Furthermore, lockstep execution can only be done at the warp level in current architectures and will become de-emphasized in Volta. You should not write GPU algorithms that rely on lockstep execution if you can avoid it. And you can pretty much always avoid it.

Furthermore, parent shared and local memory are “out of scope” for any child kernel usage, including those from cublas. This is also covered in the programming guide. You cannot have a cublas child kernel execute directly on data that is in shared memory of the parent, or in local memory of the parent

Tuee22 · May 19, 2017, 5:15pm

That’s interesting about lockstep being de-emphasized in Volta. I’m working through “Professional CUDA C Programming” right now and it emphasizes avoiding warp divergence. My understanding was that threads within the same warp would execute in lockstep so long as each thread followed the same control path, and that this is the key to avoiding performance losses due to warp divergence.

Perhaps I misunderstood though (I’m still a gpu n00b, lol). Is “writing GPU algorithms that rely on lockstep” different from the goal of ensure threads within the same block (or within sub-groups of 32 threads) follow the same control path? And is the latter still a relevant goal under the new architecture?

njuffa · May 19, 2017, 5:31pm

Yes. The former refers to a technique called “warp-synchronous programming” that (usually for performance reasons) takes advantage of the fact that threads are known to execute in a group of threads called a warp (comprising 32 threads on all GPU architectures currently supported by CUDA). This is an advanced CUDA programming concept, and prone to misapplication by inexperienced CUDA programmers.

The latter concept is “thread divergence”, which is a basic concept in CUDA program execution.

Tuee22 · May 19, 2017, 5:45pm

So avoiding thread divergence is not what gives you lockstep execution of threads within a warp?

In that case, how exactly does thread divergence harm performance?

njuffa · May 19, 2017, 6:16pm

Thread divergence reduces performance because when divergence occurs, its effect is that not all threads in the warp do useful work: some are masked off, i.e. currently inactive. The corresponding execution resources are idle, and cannot be used by threads from other warps. Even so, even when some or most threads in a warp are masked off, the threads in the warp execute in lockstep because there is only one program counter for the entire warp, not one program counter per individual thread as in a classical CPU.

The closest thing to a CUDA “thread” in a CPU are the SIMD lanes in CPUs, which are also maskable in recent x86 architecture versions. The difference of CUDA’s SIMT and classical explicit SIMD is that the SIMDness of the hardware, including the masking of the SIMD lanes, is mostly abstracted away, making it implicit SIMD, which is a lot nicer for programmers to deal with because it provides a single-thread view of program execution most of the time.

The lockstep execution of the threads in a warp gives certain desirable guarantees about the behavior of the threads in the warp relative to each other. These guarantees are being exploited by warp-synchronous programming techniques, but they are frequently misunderstood by less experienced CUDA programmers: many are weaker than they are perceived to be, and they may be difficult to correlate with HLL code. This then causes unexpected program behavior, which may not be immediately obvious, compounding the problem: the affected code seems to work perfectly in some circumstances but not others.

Tuee22 · May 19, 2017, 6:26pm

So just to make sure I understand… warp-synchronous coding is using the knowledge that warps run their threads in lock-step to avoid calling __syncthreads() when it would otherwise be necessary to avoid race conditions?

And this is ill-advised because 1) it’s easy for n00bs to flub, and 2) even when done right it’s not future-proof, as warp behaviour may change under future architectures (i.e. Volta) ?

njuffa · May 19, 2017, 6:34pm

Eliminating __synthreads() overhead is one example, correct. As for the questions: basically yes and maybe :-) It is not clear that Volta will actually break currently used warp-synchronous programming. A number of important applications use it, and I cannot imagine NVIDIA will want to break those, even if the underlying hardware implementation changes.

Tuee22 · May 19, 2017, 6:51pm

Ok fair enough.

All this was a bit of a tangent from my original question though (although an interesting one). Let me try to ask my original question again.

Let’s say that my problem is naturally parallelized in a way where each thread wants to perform some basic linear algebra function (say a matrix solve or multiply). I can either 1) hand-code it in the kernel, 2) call out to cuBLAS. For simplicity let’s assume there are no race conditions either way.

Is there an intuition for knowing ex-ante which of these would give better parallelism/performance?

njuffa · May 19, 2017, 6:59pm

Without a lot more specifics regarding your use case I would suggest to try it either way and see what works best. In many situations the ease of programming is more important than squeezing the last bit of performance so my commonly recommended strategy (for CPUs and GPUs equally) is to try canned libraries first, and revisit that decision if and only if performance proves unsatisfactory.

Topic		Replies	Views
cant call any kernel function CUDA Programming and Performance	8	4823	June 6, 2011
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4483	October 24, 2008
Asynchronous Parameter Passing how cuSPARSE/cuBLAS can do this? CUDA Programming and Performance	6	3949	March 1, 2012
Can different warps in a block do different things? CUDA Programming and Performance	8	1642	April 29, 2011
Running One Thread per CUDA Core Or at least allow for highly divergent, register-intensive kernels CUDA Programming and Performance	11	4609	December 1, 2011
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	709	October 12, 2021
Concurrent kernels on Kepler CUDA Programming and Performance	8	1052	February 23, 2014
Combining cuBlas and Kernel code CUDA Programming and Performance	14	6469	April 1, 2017
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	12879	May 8, 2010
Confused about global/device calls and starting kernals for solvers CUDA Programming and Performance	4	339	September 23, 2020

Calling a cuBLAS function from within a kernel

Related topics