Fundamental differences on ways of spreading the load

saulocpp · June 25, 2018, 9:46am

Good morning, all.
Along my slow but steady progress, I came across some questions while rewriting some serial code to work with CUDA.
They are below, and any input on any of these is welcome:

What is/are the fundamental difference/s when launching with kernel_func<<<1, 1024>>>(), <<<1024, 1>>> and <<<32, 32>>>, since all 3 calls will result in 1024 threads? The first 2 calls can be found in CUDA introductory examples that are more focused on simplicity and getting started, but eventually we get past this point and need to understand the “why”.
When we call kernel_func <<<1, 1>>> (PARAMS), is it the same as if the program ran serial on a regular CPU? That is, 1 block with just 1 thread? The reason I’m asking is because I have an array of length N that I need to iterate over, calculate its average and then scale the values and save to another array. I can possibly spread this work to more threads with a shared accumulator, accessible to all threads, then evaluate this cost of extra communication against 1 single thread doing it. So it comes to the question: does <<<1, 1>>> have the same behavior of a single CPU doing the work? You don’t need to write any code, it is more a conceptual question so I can fix my stuff accordingly.
In the case of launching a kernel that will work on an array that is much bigger than the total number of threads limited by hardware, we would need to call the kernel many times passing to the next run the reference to the position after the previous run finished.
For instance:

size_t ARRAY_LENGTH = 787583479651;	// Any huge array
size_t MAX_THREADS = 1024;			// Arbitrary value for simplicity, could be more
size_t NUM_RUNS = ARRAY_LENGTH / MAX_THREADS;	// Controls how many times the kernel is called

for(size_t i = 0; i < NUM_RUNS; i++)
	our_kernel <<<1, MAX_THREADS>>> (&array[i * MAX_THREADS], ARRAY_LENGTH);
cudaDeviceSynchronize();

Assume array was defined somewhere. Is it the correct approach or is there a best practice for such a situation?

Thanks a lot to all.

cbuchner1 · June 25, 2018, 11:59am

hard to decide which of the two alternatives is worse <<1, 1024>>> or <<<1024, 1>>>

even<<< 32, 32 >>> is far from optimal as it only launches one warp per thread block.

Some guidelines:

a) launch enough thread blocks to occupy the number of shader multiprocessors on your CPU. That’s 20 for a GTX 1080 if I remember correctly. It’s better if you launch a multiple of that as each multiprocessor may execute several blocks simultaneously.

b) launch enough threads per block to keep enough warps in flight to hide memory access latencies. A warp is a group of 32 threads. All threads of one warp execute the same instruction at the same time - there is just one program counter (PC) per warp. (a notable exception is the Volta architecture which is more advanced!)

<<1, 1024>>>: you launch one block of 1024 threads. One multiprocessor on your GPU does work, the rest sites idle.

<<1024, 1>>>: you launch 1024 blocks of 1 thread each. Only 1 ALU out of ca 192 (Kepler) or 128 (Maxwell, Pascal) per multiprocessor gets to do any work. If you’re lucky you get 4 blocks executed per multiprocessor at the same time, so 4 ALUs to work - the rest sits idle.

<<32, 32>>: that’s one warp in flight per block on 32 blocks. Not too bad, but each time that warp has to wait for any memory operation, there likely nothing else to execute and the code execution on that multiprocessor must stall - the hardware sits idle during stalls.

There is multiple instruction decoders and groups of ALUs per multiprocessor, and you want enough warps in flight per multiprocessor to keep most of them busy. This is called maximizing occupancy.

<<200, 256>>: now that’s a decent launch grid. Now find a problem that you can use that many threads on ;)

The CUDA programming guide talks about all this. This is required reading if you want to get the most out of your hardware.

Furthermore this stackoverflow thread discusses dimensioning a launch grid
https://stackoverflow.com/questions/9985912/how-do-i-choose-grid-and-block-dimensions-for-cuda-kernels

striker159 · June 25, 2018, 5:12pm

If the number of elements is greater than the available threads, you can use block-strided loops or grid-strided loops.

__global__
void kernel(int* array, int N){
    for(int index = threadIdx.x + blockIdx.x * blockdim.x; index < N; index += blockDim.x * gridDim.x){
       // each thread processes one element. if there are elements left, each thread, again, processes another element, and so on.
    }
}

You have to make sure that you use the appropriate data type for the index and make sure the multiplications do not overflow.

saulocpp · June 25, 2018, 9:30pm

Striker159, thanks for this piece of code. Is the graphical representation of this indexing that article from Mark Harris, if I’m not mistaken, “An even easier introduction to CUDA”?
Also, this is the notation for a 1D array, correct? It is the next step I need to get familiar with, spreading the load inside the kernel function instead of calling the function many times, like in my sample code.

Cbuchner1, thanks for your feedback and explanations too. Following the SO link you provided (which I stumbled upon a few times already), I also came to this occupancy tips article by Mark Harris. I was ready to start getting hardware limits from the cudaDeviceProperties struct, but there seems to be launch configuration helpers that find good starting points for these parameters given the CC versions of the hardware.

Topic		Replies	Views
Using <<<...>>> CUDA Programming and Performance	6	2478	June 19, 2011
Running One Thread per CUDA Core Or at least allow for highly divergent, register-intensive kernels CUDA Programming and Performance	11	4627	December 1, 2011
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	41	August 21, 2024
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	2004	October 12, 2021
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5397	June 3, 2008
number of threads and registers CUDA Programming and Performance	10	4867	March 14, 2008
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1410	November 18, 2018
What is the performance impact of launching many many small blocks? CUDA Programming and Performance cuda , kernel	7	154	November 7, 2024
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5980	September 5, 2008

Fundamental differences on ways of spreading the load

Related topics