Confused about global/device calls and starting kernals for solvers

I’m a little confused about how handle some code I’m trying to port to CUDA. I’ve got some solver code that in essence performs a subroutine called DSA for a given number of time steps with each time step having its own parameters to uses. These timesteps never communicate with each other so I believe it would be fairly easy to make this run in parallel. What I’m wondering about however is since the subroutine does many calculation using these parameters in nested loops as well as having things like interpolation and integration of functions. Would I need to define theses subroutines as device calls, and if so can these also be made to run in parallel? Can this global call made with some number of threads create other threaded parallel computations? I apologise if this question isn’t clear, I can provide code examples or try and be more specific if that would help.
If it’s in the documentation or starters guide please point me to it as I cannot quite find anything.

I think you might need to take a deep read into the cuda developer docs. The simd part of the model means that roughly speaking, when you issue an instruction, that instruction works on 32 different “threads” at once (which is why they say to make your block size always be a multiple of 32)

If you tried to run two different sub routines in the same warp, like:

if (threadIdx.x < 16) {
} else {

This would split the warp, the instructions for routine1 would run for 16 threads while the other 16 wait, and then routine2 would run for the other 16 threads, while the other waits. This is why in the best practices doc they say to avoid branching.

You mentioned running the device calls in parallel and I think this is an issue that you might run into. There is nothing really special about functions with the device pragma, other than you telling the compiler this is meant to be cuda code. If you want to split the code out of your global block into a separate function, you’ll need to declare it a device function. However, this function is run just like it would run if you injected the code back into the global function, with each thread in the warp running the same instructions.

You can learn how to make a function in cuda that issues a parallel job by learning about dynamic parallelism, but you’ll probably want to get more comfortable with what is actually going on when you run/write a cuda program.

hope this is somewhat helpful

1 Like

Thank you Ratzes, I appreciate your reply but I’m afraid it doesn’t quite answer my question. It does however raise some other questions. For the sake of helping my awful explanation of what I mean I’ll include some example code demonstrating what I mean since I suspect I’ve just done a poor job of explaining my confusion.

subroutine timeStepCalc(args)
Do i = 1, size(args)
call regularLoop(args(i))
call nestedLoop(args(i))
call integrate(func,args(i))
end subroutine timeStepCalc

subroutine regularLoop(timeStepArg)
do i = 1; N
!something happens here
end subroutine regularLoop
subroutine nestedLoop(timeStepArg)
do i = 1, N
do j = 1, N
!do something here
end subroutine nestedLoop
subroutine integrate(func, timeStepArg)
!integrate a function over N for a given function and arg
end subroutine integrate

So this is the plain version of the code with no CUDA. What I would like to do ideally is given 21 time steps launch 21 of timeStepCalc running in parallel which would ideally contain threads to perform the loops in parallel as well.
Would it be possible for example to launch 21 blocks of 64 threads and have each instance of timeStepCalc running on one block and each loop running using those 64 threads. I don’t believe there would be branching (unless I misunderstand what you mean by branching) since each call of timeStepCalc would take place on it’s own block with no communication between them and none of the loop within timeStepCalc would need to branch either.

And while not directly related to this question, couldn’t you just get around the branching problem by not branching into routines smaller than 32 threads so each gets it’s own warp?

Again, I apologise if this question isn’t clear, or if it seems obvious, I just can’t seem to make sense of it myself nor can I find any answers I can quite make sense of.

You are correct, if the entire warp goes into the same branch, you dont have a warp divergence.

If I was in your position, i’d see if the timesteps can be a large enough number to do each routine as a separate launch. That way you are running the same code for every thread in the gpu and it’s much easier to manage.

i.e. if you’re doing something like 20,000 timesteps, you could do a launch like

regularLoop<100, 256>(kernelArgs);

where each launch stores its results to a global memory location to share data.

If you really want to do all of this in parallel, you’d probably do a launch into a global function like:

idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < someNumberDivisibleBy32) {
 * device call into regularLoop
} else if (idx < someOtherNumberDivisibleBy32) {
 * device call into nestedLoop

and if you want to do the loops in parallel you do something similar in each device call where different threads are doing different n for the loop.

1 Like

Unfortunately I don’t believe I can launch each of the sub operations as it’s own kernal easily since some of them are beyond my current ability to make parallel (looking at you interpolation). Also, I’m a little bit confused about why you’re splitting the two loops subroutines. They need to happen one after the other so wouldn’t it just be

device call to regular
device call to nested
device call to next subroutines…

with the main speedup being from doing each of the loops (1 through N) in parallel as well as doing each high level call to timeStepCalc in parallel. Unless I’m still misunderstanding and that’s what you’re saying.

Also, thank you for the quick replies, I really cannot thank you enough.