GPU vs CPU - how large can threads be?

I have a large parallel program that was written for a cluster. The software processes a matrix of approx. dimensions [30,000 x 40,000] single precision complex samples. The current method of processing splits the matrix into 40,000 vectors and sends each vector to the N processors. Thus, each CPU would get approx. 40,000/N vectors to process.

My question is this. Is it possible (or appropriate) to use the CUDA threads to replace the CPU as shown above?

I have been reading that CUDA threads are lightweight, and the examples I see are threads performing simple scalar calculations. In the above CPU example, each CPU thread is performing many FFT’s, FFT shifts, and a lot of element-wise calculations.

There is virtually no limit to the amount of work a CUDA thread can do ( but have an eye on the watchdog timer if your computation is in excess of 5 seconds of run time )

There are certain a limits to the complexity of the code in terms of
a) number of registers per thread
B) use of local memory when under register pressure
c) availability and use of shared memory
d) total number of instructions that fit into CUDA’s instruction cache
e) size of L1/L2 caches on the Fermi architecture

There is a sweet spot in performance that is sometimes hard to determine. When missing the sweet spot, you’re not making full use of CUDA’s potential (i.e. low occupancy of the hardware, wasted memory bandwidth due to excessive local memory use)

A simple for() loop iterating across 40000/N vector elements is not an issue at all. Just create enough thread blocks to fill the available number of multiprocessors in the hardware and make sure each thread block has a decent number of parallel threads going (typically 96 or 128 threads minimum). Be sure to access your data with coalesced access patterns.

Most likely it will work, probably you have not big branch divirgence. Can you split task so each thread deals with one vector, so you have 30000 threads?

With the matrix of dimensions [30,000 (samples) x 40,000 (vectors)], each vector is independent. So yeah, I could have 40,000 independent threads running to perform the vector calculations.

Interesting. As an example, here’s a faked-up OpenMP-like implementation:

#pramga parallel for

for s = 1:available_threads

for v = 1:vectors_on_thread

	  pong(v) = fft( ping(v) );

	  ping(v) = fftshift( pong(v) );

	  pong(v) = fun1( ping(v) );

	  ping(v) = fun2( pong(v) );

	  pong(v) = ifft(ping(v));



So it should not be an issue to run that inner code on each thread (barring the limitations you listed)?

That’s good to hear. It would make the porting of the existing code much easier as a first cut.

Hard to say but it sounds like a feasible problem. Large sized problem with high arithemetic intensity and SIMD like operations.

Your problem size is (310^4) * (410^4 ) * 8 bytes ( complex single precision) * 10^-9 ~= 1.2 GB which might lead you to want to do some tinkering on lesser cards.

You would probably want to divide your CPU workloads not into single CUDA threads but into 1 or several thread blocks. Meaning that youl would have one or several thread blocks doing the processing work of what a single CPU used to do ( you would end up having 40,000+ thread blocks ).

Excellent. I appreciate the quick replies. I guess I will dive in.

Thanks all

If your particular implementation of FFT/IFFT uses recursion, it might not even compile on CUDA.

You might want to investigate using CuFFT in batched mode instead. This offers a quite good implementation
and lets you focus on other things.


Good that function is the same on all threads. You may get good coalesced access if you arrange data properly. Also need to check optimal balance between registers and local and global memory usage.