Is CUDA worth it when my algorithm can not use coalesced memory?

zombi3 · June 10, 2014, 7:44am

My algorithm uses random pointers, so I can’t coalesce the memory access. Also, I have more data than can fit on my GPU, so I have to make transfers. I have 30 GB of data (that fits in RAM) and 1GB of GPU memory. I would have to make 60 transfers and no memory coalescing while on the GPU. So is it worth it, or should I just keep the processing on the CPU?

GPU = PCIe 3.0 16x, ~ 6 GB/s bandwidth
CPU = 8 core,
RAM = 32 GB PC3 1333 Mhz ~ 10 GB/s bandwidth

(Just transfer times…)
on CPU covering the data would take 3 seconds,
on GPU it would take 10 sec for transfers and less for processing (I think).

Are my assumptions right, should I just use the CPU?

little_jimmy · June 10, 2014, 8:16am

memory coalescing is equally a function of algorithm implementation, and rewriting an algorithm for a parallel platform may very well cause you to fundamentally rethink the algorithm itself, with implications

coalesced memory aside for now, what is the average execution time of the algorithm on the cpu?
in a nutshell, what does the algorithm do?
also, is it a ‘critical’ application where certain performance objectives come into play

these should be the deciding factors

And where do you get all that data from, in the first place…?

zombi3 · June 10, 2014, 8:57am

It’s an economic simulation, and 30 GB is analyzed every time step. The data represents people and their transactions. Also, I could get ~ 10% coalescing with some tricks, but that’s not really a big improvement, still basically not helping much. There is unavoidable randomness. It would seem like the only limiting factor is the memory speed (on both cpu and gpu) when there is essentially no coalescing. And all the CPU->GPU data transfers would seem to take too long. I really don’t want to program both because it’s a large model, so I need some kind of general guidance on which approach is most likely to be worth the time.

I don’t know the run times, but here is my hardware though… GPU: NVIDIA GTX 480, CPU: Intel Core i7-840QM

cbuchner1 · June 10, 2014, 9:12am

if your data access isn’t completely random, consider putting your data into textures (or accessing it through the __ldg intrinsic). It’s not as good as coalesced access, but it will bring a good speed boost for clustered or local access at least.

pasoleatis · June 10, 2014, 9:37am

When doing MD I have lots of uncoalesced reads. Somehow the code is still faster than the cpu version. I assume it is because of the L1 and L2 caches. I also use shared memory because after the initial reads there are a lots of operation and I reuse the data.

little_jimmy · June 10, 2014, 1:22pm

in my experience, compared to cpu - serial programming - gpu - or simply parallel - programming is more challenging, but also more rewarding - it requires more effort and thinking, with reward
whenever i switch from gpu programming to cpu programming, it feels like i am back in kinder-garden

if you are pressed for time, serial programming would likely be a quick resolve; also nice about it is that it normally provides a basis to further build (gpu code) from
if you are serious about performance, and getting your algorithm as optimal as possible, consider a parallel implementation

it has happened that, rewriting a serial implementation for a parallel platform, has caused me to entirely rethink the algorithm and its implementation, leaving me with a version that runs much faster on both serial and parallel platforms

Topic		Replies	Views
Cuda/OpenCL Optimization Find a compromise between time needed to optimization and performance CUDA Programming and Performance	3	4042	May 4, 2010
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6040	December 8, 2008
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8598	December 7, 2009
coalescing future.. CUDA Programming and Performance	6	2694	April 7, 2008
A (not so) hypothetical question CUDA Programming and Performance	6	1642	March 24, 2009
I hope to know that, why GPU faster than CPU in Ge CUDA Programming and Performance	5	4245	December 28, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4497	October 24, 2008
Why does GPU code require more RAM than same code on CPU? CUDA Programming and Performance	3	510	January 13, 2025
Memory Coalescing CUDA Programming and Performance	4	3226	July 28, 2009
paging stratigies for global memory any paging strategy on the way for CUDA CUDA Programming and Performance	3	2220	November 26, 2008

Is CUDA worth it when my algorithm can not use coalesced memory?

Related topics