Performing average calculation of an array[2048][2048] expedient use of cuda in this case?

schmeing · August 25, 2009, 9:30am

Hey!

I want to calculate the average of an array (with 2048x2048 short-elements) for each column. Can I improve this by using CUDA? If yes, what is the best way? The problem is not to read the data efficient from the global memory, but how can I store the extensions without memory conflicts and performance reduction?

I hope anybody can help me!

Thanks in advance

Keldor314 · August 25, 2009, 9:52am

For a simple operation like averaging, the CPU can do it in just the time it takes to ship the data out to the GPU. The question therefore is will you be sending the data over anyway. If the GPU needs the data anyhow, then doing the averaging with CUDA is a clear win. Otherwise, you’ll almost certainly spend more time transferring data from the CPU to the GPU and back than just computing on the CPU.

schmeing · August 25, 2009, 10:18am

Thank you for your answer!

Yes, I know. I want to perform further Operations on GPU with the same data and the average. But how to calculate it? Is there any way to use shared memory for improving the calculation?

YDD · August 25, 2009, 7:47pm

To calculate the average of all the elements, just use a reduction sum on the entire array (see the SDK, or use Thrust), and then divide by the number of elements.

apaehler · August 25, 2009, 8:09pm

You can use CUDPP for that.

Sarnath · August 26, 2009, 6:53am

Dedicate 1 thread per column and hence you have 2048 threads working for you.

Arranging the array in Row-major order (usual C order) will make sure that memory accesses are coalesced.

Each thread needs to maintain a per-thread local variable that will hold the sum as the thread iterates over the column.

Once the sum is calculated , just divide it by 2048 and get the average…

Now have a shared memory array of size of blockDim.x. Each thread can write their average into shared memory…

Thus you have average of blockDim.x number of columns in shared memory…

Choose your blockDim.x accordingly so that your further reduction operation makes the best use of shared memory (Take cuda Occupancy into consideration)…

THats all…

Best Regards,
Sarnath

schmeing · August 26, 2009, 9:05am

Thank you very much! This helps a lot! I will try it.

schmeing · August 26, 2009, 9:32am

Hi!

A second question for understanding:

Let’s assume, AFTER it, I want to calculate the average of each ROW, what is the fastest way to do that? I think, the fastest way is to reorganize the global memory of the device into coalesced memory. This means I have to allocate a second area of memory on the device and than I have to transform the rows into columns and the other way round. Or is there any trick to do it faster?

Best thanks in advance!

Sarnath · August 26, 2009, 10:41am

Yes, Directly on CUDA, you have no other efficient way but to have an alternate copy. You could create the alternate copy also using another GPU kernel (tranpose kernel).
Since global memory values are persistent across kernel launches, you could do this easily

The trick is to see from a higher level.

For example, det(A) = det(A^T) { i think so… det stands for determinant }

So, if your problem can be solved for A^T instead of A and if the final result can be transformed easily to suit A – then you can go for that…

schmeing · August 26, 2009, 12:00pm

Thanks again!

Topic		Replies	Views
Problems with using shared memory CUDA Programming and Performance	7	6044	January 13, 2011
Optimization help CUDA Programming and Performance	13	10033	April 8, 2008
Looking for performance tweak suggestions CUDA Programming and Performance	5	70	August 12, 2024
Global arrays? CUDA Programming and Performance	24	10629	August 18, 2010
Computing mean and standard deviation in parallel Cna we extend Parallel Reduction? CUDA Programming and Performance	6	13809	April 30, 2009
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2326	January 21, 2010
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13932	September 5, 2008
I want to calculate the sum of the 512 lines CUDA Programming and Performance	16	2016	January 4, 2013
CUDA calculate median of 4096 elements array CUDA Programming and Performance cuda	3	148	September 2, 2024
Array Comparision CUDA Programming and Performance	4	4242	May 31, 2009

Performing average calculation of an array[2048][2048] expedient use of cuda in this case?

Related topics