Good morning, all.
Along my slow but steady progress, I came across some questions while rewriting some serial code to work with CUDA.
They are below, and any input on any of these is welcome:
What is/are the fundamental difference/s when launching with kernel_func<<<1, 1024>>>(), <<<1024, 1>>> and <<<32, 32>>>, since all 3 calls will result in 1024 threads? The first 2 calls can be found in CUDA introductory examples that are more focused on simplicity and getting started, but eventually we get past this point and need to understand the “why”.
When we call kernel_func <<<1, 1>>> (PARAMS), is it the same as if the program ran serial on a regular CPU? That is, 1 block with just 1 thread? The reason I’m asking is because I have an array of length N that I need to iterate over, calculate its average and then scale the values and save to another array. I can possibly spread this work to more threads with a shared accumulator, accessible to all threads, then evaluate this cost of extra communication against 1 single thread doing it. So it comes to the question: does <<<1, 1>>> have the same behavior of a single CPU doing the work? You don’t need to write any code, it is more a conceptual question so I can fix my stuff accordingly.
In the case of launching a kernel that will work on an array that is much bigger than the total number of threads limited by hardware, we would need to call the kernel many times passing to the next run the reference to the position after the previous run finished.
size_t ARRAY_LENGTH = 787583479651; // Any huge array size_t MAX_THREADS = 1024; // Arbitrary value for simplicity, could be more size_t NUM_RUNS = ARRAY_LENGTH / MAX_THREADS; // Controls how many times the kernel is called for(size_t i = 0; i < NUM_RUNS; i++) our_kernel <<<1, MAX_THREADS>>> (&array[i * MAX_THREADS], ARRAY_LENGTH); cudaDeviceSynchronize();
Assume array was defined somewhere. Is it the correct approach or is there a best practice for such a situation?
Thanks a lot to all.