Hi, I hope you can forgive the basic nature of these questions, I’m rather new to GPU coding and I’ve tried read many of the online resources available but unfortunately I’m still a tad lost about the very basic things. Specifically I’ve got ~two questions that I can’t seem to wrap my head around.
I’m still a tad confused at the difference between a (global) subroutine and a (device) subroutine. It seems to me that both of them run on the device, the only difference being that one is a kernal and one isn’t. My assumption is that a (global) subroutine might call (device) subroutines but I’m not quite sure.
My second question is really three related questions, I’m a tad confused about chevron syntax. I understand it’s used to specify thread blocks and threads. But I’m a little confused as to how it works. I understand that for something like <<<x,y>>> x is the blocks and y is the threads but I’m a little confused as to how it works in practice with a subroutine.
Specifically I have two subroutines both of which move through arrays, one that handles 1D arrays and one that handles 2D arrays. If I want to split up the loop to speed it up and the subroutines are
handleOneDim(array, multiplier) (one do loop through the array)
handleTwoDim(array, multiplier) (nested do loops through the array)
Do I need to do something within the subroutine to specify that the loops should be done using threads and blocks on the GPU or is just putting the <<<x,y>>> at the beginning sufficient to tell the compiler to handle it. Final part of this question is, and this may be a tad odd, how would I find an ideal number of blocks and threads to use? I’m running this code on a cluster so the hardware that is available is dependant on my needs, but let’s say I’m just using one GPU, how would I know how many threads and how many blocks to use?
Again, I apologise for the basic nature of these questions, I just can’t quite wrap my head around it through just looking at the documentation