GPU vs CPU - how large can threads be?

baranm · May 12, 2010, 2:25pm

I have a large parallel program that was written for a cluster. The software processes a matrix of approx. dimensions [30,000 x 40,000] single precision complex samples. The current method of processing splits the matrix into 40,000 vectors and sends each vector to the N processors. Thus, each CPU would get approx. 40,000/N vectors to process.

My question is this. Is it possible (or appropriate) to use the CUDA threads to replace the CPU as shown above?

I have been reading that CUDA threads are lightweight, and the examples I see are threads performing simple scalar calculations. In the above CPU example, each CPU thread is performing many FFT’s, FFT shifts, and a lot of element-wise calculations.

cbuchner1 · May 12, 2010, 2:38pm

There is virtually no limit to the amount of work a CUDA thread can do ( but have an eye on the watchdog timer if your computation is in excess of 5 seconds of run time )

There are certain a limits to the complexity of the code in terms of
a) number of registers per thread
B) use of local memory when under register pressure
c) availability and use of shared memory
d) total number of instructions that fit into CUDA’s instruction cache
e) size of L1/L2 caches on the Fermi architecture

There is a sweet spot in performance that is sometimes hard to determine. When missing the sweet spot, you’re not making full use of CUDA’s potential (i.e. low occupancy of the hardware, wasted memory bandwidth due to excessive local memory use)

A simple for() loop iterating across 40000/N vector elements is not an issue at all. Just create enough thread blocks to fill the available number of multiprocessors in the hardware and make sure each thread block has a decent number of parallel threads going (typically 96 or 128 threads minimum). Be sure to access your data with coalesced access patterns.

Lev · May 12, 2010, 2:44pm

Most likely it will work, probably you have not big branch divirgence. Can you split task so each thread deals with one vector, so you have 30000 threads?

baranm · May 12, 2010, 2:58pm

With the matrix of dimensions [30,000 (samples) x 40,000 (vectors)], each vector is independent. So yeah, I could have 40,000 independent threads running to perform the vector calculations.

baranm · May 12, 2010, 3:06pm

There is virtually no limit to the amount of work a CUDA thread can do ( but have an eye on the watchdog timer if your computation is in excess of 5 seconds of run time )

There are certain a limits to the complexity of the code in terms of

a) number of registers per thread

B) use of local memory when under register pressure

c) availability and use of shared memory

d) total number of instructions that fit into CUDA’s instruction cache

e) size of L1/L2 caches on the Fermi architecture

There is a sweet spot in performance that is sometimes hard to determine. When missing the sweet spot, you’re not making full use of CUDA’s potential (i.e. low occupancy of the hardware, wasted memory bandwidth due to excessive local memory use)

A simple for() loop iterating across 40000/N vector elements is not an issue at all. Just create enough thread blocks to fill the available number of multiprocessors in the hardware and make sure each thread block has a decent number of parallel threads going (typically 96 or 128 threads minimum). Be sure to access your data with coalesced access patterns.

Interesting. As an example, here’s a faked-up OpenMP-like implementation:

#pramga parallel for

for s = 1:available_threads

for v = 1:vectors_on_thread

	  pong(v) = fft( ping(v) );

	  ping(v) = fftshift( pong(v) );

	  pong(v) = fun1( ping(v) );

	  ping(v) = fun2( pong(v) );

	  pong(v) = ifft(ping(v));

   end

end

So it should not be an issue to run that inner code on each thread (barring the limitations you listed)?

That’s good to hear. It would make the porting of the existing code much easier as a first cut.

Jimmy_Pettersson · May 12, 2010, 3:15pm

Hard to say but it sounds like a feasible problem. Large sized problem with high arithemetic intensity and SIMD like operations.

Your problem size is (310^4) * (410^4 ) * 8 bytes ( complex single precision) * 10^-9 ~= 1.2 GB which might lead you to want to do some tinkering on lesser cards.

You would probably want to divide your CPU workloads not into single CUDA threads but into 1 or several thread blocks. Meaning that youl would have one or several thread blocks doing the processing work of what a single CPU used to do ( you would end up having 40,000+ thread blocks ).

baranm · May 12, 2010, 3:52pm

Excellent. I appreciate the quick replies. I guess I will dive in.

Thanks all

cbuchner1 · May 12, 2010, 4:37pm

If your particular implementation of FFT/IFFT uses recursion, it might not even compile on CUDA.

You might want to investigate using CuFFT in batched mode instead. This offers a quite good implementation
and lets you focus on other things.

Christian

Lev · May 12, 2010, 4:56pm

Good that function is the same on all threads. You may get good coalesced access if you arrange data properly. Also need to check optimal balance between registers and local and global memory usage.

Topic		Replies	Views
Management of threads CUDA Programming and Performance	3	1892	March 28, 2010
Operation result depend on number of threads? CUDA Programming and Performance	2	541	May 6, 2014
Block/threads and stuff... CUDA Programming and Performance	5	5017	September 12, 2008
Parallel processing with large arrays CUDA Programming and Performance	9	6452	April 2, 2008
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63193	January 25, 2009
Parallel Matrix Multiplication in Cuda - A Question about Threads/Blocks and Tensor Cores Jetson Xavier NX cuda	2	1106	February 3, 2021
Determining the kernel dimension CUDA Programming and Performance	3	4541	May 26, 2009
number of threads and registers CUDA Programming and Performance	10	5049	March 14, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4671	October 24, 2008
What parameters to choose - threads, blocks, warps CUDA Programming and Performance	3	433	October 14, 2022

GPU vs CPU - how large can threads be?

Related topics