I will be writing a program in Python/Numba for CUDA and my high level algorithm includes several functions which should be parallelizable.
To try to summarize the problem, I have around 5,000 2D arrays created on CPU which contain float data. All arrays contain different values, they are independent and the (x, y) dimension of each array is (7, 240).
I would like the device to apply the same arithmetic function (kernel) to each row in a given array and do this for each of the 5,000 arrays, and output the result to a new array.
The elements of the array used by the arithmetic function can be any element of a given array, they are not necessarily all in the same row.
And this is where I’m stuck…
From what I understood up to now one CUDA thread is responsible for only one element in an array? How can I manage the threads so they work on several elements in the array? Do you have any specific documentation on this you can point me to? PS: I’m not sure if the description of my problem is very clear sorry if it isn’t I’ve just started learning CUDA…
Thank you for your help.
A CUDA thread can work on any number of array elements the programmer chooses. That is, the mapping of data to threads is completely open and up to the programmer. It is, however, fairly common in CUDA code to assign one thread to each output element, and have that thread pull as many input elements as are necessary to produce that output element. But programmers are not limited to such an arrangement, it just so happens that this is often practical.
Is there a particular reason to have “arrays created on CPU”? That sounds like the code would be shuffling a lot of data between host (CPU) and device (GPU), which is best avoided. Why no create these arrays on the GPU?
Thank you for your quick answer this is already clearer for me when you explain that one thread can work on as many input elements as required.
Regarding the reason to have “arrays created on CPU” my description of the problem wasn’t sufficiently detailed.
The 5,000 arrays are created on the CPU but those are the input arrays containing the data I wish to process in parallel on the GPU (the original data is stored in .csv files, there is one file for each array, so there are 5,000 .csv files). Then I allocate GPU memory of the size of all 5,000 arrays, transfer the data from CPU to GPU and run the CUDA kernel(s). Once the kernel(s) complete their work, the data (a single output array this time) is transferred back to CPU . Of course you are right data transfer between the host and device should be minimized as part of the CUDA recommendations. But maybe you mean that I should create a single array on CPU containing all 5,000 input arrays and only after that transfer the array to GPU instead of transferring all 5,000 arrays individually (the volume of data in both cases would be the same).
Thanks again for your help on understanding how threads are assigned to input elements.
you explain that one thread can work on as many input elements as required.
I guess I wasn’t clear enough. Let me try again: A CUDA thread can work on as many input elements and as many output elements as the programmer desires. There are no limitations. Above I just mentioned a common arrangement that many programmers choose to use because it is often advantageous: N input elements per thread, 1 output element per thread.
Thank you for the clarification.