I want to calculate the average of an array (with 2048x2048 short-elements) for each column. Can I improve this by using CUDA? If yes, what is the best way? The problem is not to read the data efficient from the global memory, but how can I store the extensions without memory conflicts and performance reduction?
For a simple operation like averaging, the CPU can do it in just the time it takes to ship the data out to the GPU. The question therefore is will you be sending the data over anyway. If the GPU needs the data anyhow, then doing the averaging with CUDA is a clear win. Otherwise, you’ll almost certainly spend more time transferring data from the CPU to the GPU and back than just computing on the CPU.
Yes, I know. I want to perform further Operations on GPU with the same data and the average. But how to calculate it? Is there any way to use shared memory for improving the calculation?
To calculate the average of all the elements, just use a reduction sum on the entire array (see the SDK, or use Thrust), and then divide by the number of elements.
Dedicate 1 thread per column and hence you have 2048 threads working for you.
Arranging the array in Row-major order (usual C order) will make sure that memory accesses are coalesced.
Each thread needs to maintain a per-thread local variable that will hold the sum as the thread iterates over the column.
Once the sum is calculated , just divide it by 2048 and get the average…
Now have a shared memory array of size of blockDim.x. Each thread can write their average into shared memory…
Thus you have average of blockDim.x number of columns in shared memory…
Choose your blockDim.x accordingly so that your further reduction operation makes the best use of shared memory (Take cuda Occupancy into consideration)…
Let’s assume, AFTER it, I want to calculate the average of each ROW, what is the fastest way to do that? I think, the fastest way is to reorganize the global memory of the device into coalesced memory. This means I have to allocate a second area of memory on the device and than I have to transform the rows into columns and the other way round. Or is there any trick to do it faster?
Yes, Directly on CUDA, you have no other efficient way but to have an alternate copy. You could create the alternate copy also using another GPU kernel (tranpose kernel).
Since global memory values are persistent across kernel launches, you could do this easily
The trick is to see from a higher level.
For example, det(A) = det(A^T) { i think so… det stands for determinant }
So, if your problem can be solved for A^T instead of A and if the final result can be transformed easily to suit A – then you can go for that…