Fundamental differences on ways of spreading the load

Good morning, all.
Along my slow but steady progress, I came across some questions while rewriting some serial code to work with CUDA.
They are below, and any input on any of these is welcome:

  • What is/are the fundamental difference/s when launching with kernel_func<<<1, 1024>>>(), <<<1024, 1>>> and <<<32, 32>>>, since all 3 calls will result in 1024 threads? The first 2 calls can be found in CUDA introductory examples that are more focused on simplicity and getting started, but eventually we get past this point and need to understand the “why”.

  • When we call kernel_func <<<1, 1>>> (PARAMS), is it the same as if the program ran serial on a regular CPU? That is, 1 block with just 1 thread? The reason I’m asking is because I have an array of length N that I need to iterate over, calculate its average and then scale the values and save to another array. I can possibly spread this work to more threads with a shared accumulator, accessible to all threads, then evaluate this cost of extra communication against 1 single thread doing it. So it comes to the question: does <<<1, 1>>> have the same behavior of a single CPU doing the work? You don’t need to write any code, it is more a conceptual question so I can fix my stuff accordingly.

  • In the case of launching a kernel that will work on an array that is much bigger than the total number of threads limited by hardware, we would need to call the kernel many times passing to the next run the reference to the position after the previous run finished.
    For instance:

size_t ARRAY_LENGTH = 787583479651;	// Any huge array
size_t MAX_THREADS = 1024;			// Arbitrary value for simplicity, could be more
size_t NUM_RUNS = ARRAY_LENGTH / MAX_THREADS;	// Controls how many times the kernel is called

for(size_t i = 0; i < NUM_RUNS; i++)
	our_kernel <<<1, MAX_THREADS>>> (&array[i * MAX_THREADS], ARRAY_LENGTH);
cudaDeviceSynchronize();

Assume array was defined somewhere. Is it the correct approach or is there a best practice for such a situation?

Thanks a lot to all.

hard to decide which of the two alternatives is worse <<1, 1024>>> or <<<1024, 1>>>

even<<< 32, 32 >>> is far from optimal as it only launches one warp per thread block.

Some guidelines:

a) launch enough thread blocks to occupy the number of shader multiprocessors on your CPU. That’s 20 for a GTX 1080 if I remember correctly. It’s better if you launch a multiple of that as each multiprocessor may execute several blocks simultaneously.

b) launch enough threads per block to keep enough warps in flight to hide memory access latencies. A warp is a group of 32 threads. All threads of one warp execute the same instruction at the same time - there is just one program counter (PC) per warp. (a notable exception is the Volta architecture which is more advanced!)

<<1, 1024>>>: you launch one block of 1024 threads. One multiprocessor on your GPU does work, the rest sites idle.

<<1024, 1>>>: you launch 1024 blocks of 1 thread each. Only 1 ALU out of ca 192 (Kepler) or 128 (Maxwell, Pascal) per multiprocessor gets to do any work. If you’re lucky you get 4 blocks executed per multiprocessor at the same time, so 4 ALUs to work - the rest sits idle.

<<32, 32>>: that’s one warp in flight per block on 32 blocks. Not too bad, but each time that warp has to wait for any memory operation, there likely nothing else to execute and the code execution on that multiprocessor must stall - the hardware sits idle during stalls.

There is multiple instruction decoders and groups of ALUs per multiprocessor, and you want enough warps in flight per multiprocessor to keep most of them busy. This is called maximizing occupancy.

<<200, 256>>: now that’s a decent launch grid. Now find a problem that you can use that many threads on ;)

The CUDA programming guide talks about all this. This is required reading if you want to get the most out of your hardware.

Furthermore this stackoverflow thread discusses dimensioning a launch grid
https://stackoverflow.com/questions/9985912/how-do-i-choose-grid-and-block-dimensions-for-cuda-kernels

If the number of elements is greater than the available threads, you can use block-strided loops or grid-strided loops.

__global__
void kernel(int* array, int N){
    for(int index = threadIdx.x + blockIdx.x * blockdim.x; index < N; index += blockDim.x * gridDim.x){
       // each thread processes one element. if there are elements left, each thread, again, processes another element, and so on.
    }
}

You have to make sure that you use the appropriate data type for the index and make sure the multiplications do not overflow.

Striker159, thanks for this piece of code. Is the graphical representation of this indexing that article from Mark Harris, if I’m not mistaken, “An even easier introduction to CUDA”?
Also, this is the notation for a 1D array, correct? It is the next step I need to get familiar with, spreading the load inside the kernel function instead of calling the function many times, like in my sample code.

Cbuchner1, thanks for your feedback and explanations too. Following the SO link you provided (which I stumbled upon a few times already), I also came to this occupancy tips article by Mark Harris. I was ready to start getting hardware limits from the cudaDeviceProperties struct, but there seems to be launch configuration helpers that find good starting points for these parameters given the CC versions of the hardware.