CUDAfortran beginner questions

Hi, I hope you can forgive the basic nature of these questions, I’m rather new to GPU coding and I’ve tried read many of the online resources available but unfortunately I’m still a tad lost about the very basic things. Specifically I’ve got ~two questions that I can’t seem to wrap my head around.

I’m still a tad confused at the difference between a (global) subroutine and a (device) subroutine. It seems to me that both of them run on the device, the only difference being that one is a kernal and one isn’t. My assumption is that a (global) subroutine might call (device) subroutines but I’m not quite sure.

My second question is really three related questions, I’m a tad confused about chevron syntax. I understand it’s used to specify thread blocks and threads. But I’m a little confused as to how it works. I understand that for something like <<<x,y>>> x is the blocks and y is the threads but I’m a little confused as to how it works in practice with a subroutine.

Specifically I have two subroutines both of which move through arrays, one that handles 1D arrays and one that handles 2D arrays. If I want to split up the loop to speed it up and the subroutines are
handleOneDim(array, multiplier) (one do loop through the array)
handleTwoDim(array, multiplier) (nested do loops through the array)
Do I need to do something within the subroutine to specify that the loops should be done using threads and blocks on the GPU or is just putting the <<<x,y>>> at the beginning sufficient to tell the compiler to handle it. Final part of this question is, and this may be a tad odd, how would I find an ideal number of blocks and threads to use? I’m running this code on a cluster so the hardware that is available is dependant on my needs, but let’s say I’m just using one GPU, how would I know how many threads and how many blocks to use?

Again, I apologise for the basic nature of these questions, I just can’t quite wrap my head around it through just looking at the documentation

On the first, you are partially correct about globals always calling device functions. Think about it this way: the nvcc compiler is considering the global to be something that should compile to being called from the CPU and ultimately running on the GPU. On the other hand, device functions are called from the GPU and run on the GPU. Because the CPU is always (usually) the one who starts GPU stuff running, you are right that globals tend to call device functions.

Now, there is a slight exception to this. CUDA 8 introduced “dynamic parallelism” which allows global calls to start from within global calls. To understand how this is different, you need to think about the threads running your code. For global calls calling device functions, you keep the same number of threads, but global functions starting other global functions means adding more threads to start doing work. Hence the adjective “dynamic” since you can increase or decrease the amount of processing power working on part of the problem.

As for your second question, you are right about the chevrons. They should be placed before the arguments and after the subroutine’s name. Think about your global function like this: every piece of code inside it is run by each thread. So, to control whether the look is over threads or over blocks, you should use the built-in variables threadIdx.x, blockIdx.x, or the like.

As for picking the balance between threads and blocks, here is a study on it. If you had an infinite amount of registers on the device (things that hold variables which are about to be operated on by the GPU), it would make sense to always use 32 threads per block. However, in order to fully use the device, you may need to use 64, 128, or maybe more more threads per block. I think it depends on the register usage of your code.

https://www.researchgate.net/publication/299869714_Meta-programming_and_auto-tuning_in_the_search_for_high_performance_GPU_code

Lastly, as a former fortran user, we’d all be better off if the language was left behind. Personal opinion, but there is a reason you’re having a hard time finding documentation/examples of it. It is simply inferior to C++ in every fathomable way, I have found.

Thanks, that does answer most of my questions. Although I’m still a little confused about the chevron syntax and changing the subroutines.
If subroutine handOneDim has a do loop that just loops through the array and multiplies everything by the multiplier for example, what would I need to put anything in it besides
do i =1, size(array)
array(i) * multiplier
end do

Or with the compiler automatically split the loop up amongst the threads and blocks?

The compiler doesn’t automatically do that. To do what you’re talking about, your kernel would look something more like:

i=threadIdx%x + blockIdx%x * blockDim%x
do while (i < size(array))
    array(i) = array(i) * multiplier
    i = i + blockDim%x * gridDim%x
end do

There isn’t a whole lot the compiler will do a lot for you in CUDA. What you write tends to be not too far from what actually gets run on the GPU, although the compiler will optimize or change many expressions to obtain desirable behavior. If you want something that will automatically handle parallelization like that, you should look into OpenACC.