My algorithm uses random pointers, so I can’t coalesce the memory access. Also, I have more data than can fit on my GPU, so I have to make transfers. I have 30 GB of data (that fits in RAM) and 1GB of GPU memory. I would have to make 60 transfers and no memory coalescing while on the GPU. So is it worth it, or should I just keep the processing on the CPU?
GPU = PCIe 3.0 16x, ~ 6 GB/s bandwidth
CPU = 8 core,
RAM = 32 GB PC3 1333 Mhz ~ 10 GB/s bandwidth
(Just transfer times…)
on CPU covering the data would take 3 seconds,
on GPU it would take 10 sec for transfers and less for processing (I think).
Are my assumptions right, should I just use the CPU?
memory coalescing is equally a function of algorithm implementation, and rewriting an algorithm for a parallel platform may very well cause you to fundamentally rethink the algorithm itself, with implications
coalesced memory aside for now, what is the average execution time of the algorithm on the cpu?
in a nutshell, what does the algorithm do?
also, is it a ‘critical’ application where certain performance objectives come into play
these should be the deciding factors
And where do you get all that data from, in the first place…?
It’s an economic simulation, and 30 GB is analyzed every time step. The data represents people and their transactions. Also, I could get ~ 10% coalescing with some tricks, but that’s not really a big improvement, still basically not helping much. There is unavoidable randomness. It would seem like the only limiting factor is the memory speed (on both cpu and gpu) when there is essentially no coalescing. And all the CPU->GPU data transfers would seem to take too long. I really don’t want to program both because it’s a large model, so I need some kind of general guidance on which approach is most likely to be worth the time.
I don’t know the run times, but here is my hardware though… GPU: NVIDIA GTX 480, CPU: Intel Core i7-840QM
if your data access isn’t completely random, consider putting your data into textures (or accessing it through the __ldg intrinsic). It’s not as good as coalesced access, but it will bring a good speed boost for clustered or local access at least.
When doing MD I have lots of uncoalesced reads. Somehow the code is still faster than the cpu version. I assume it is because of the L1 and L2 caches. I also use shared memory because after the initial reads there are a lots of operation and I reuse the data.
in my experience, compared to cpu - serial programming - gpu - or simply parallel - programming is more challenging, but also more rewarding - it requires more effort and thinking, with reward
whenever i switch from gpu programming to cpu programming, it feels like i am back in kinder-garden
if you are pressed for time, serial programming would likely be a quick resolve; also nice about it is that it normally provides a basis to further build (gpu code) from
if you are serious about performance, and getting your algorithm as optimal as possible, consider a parallel implementation
it has happened that, rewriting a serial implementation for a parallel platform, has caused me to entirely rethink the algorithm and its implementation, leaving me with a version that runs much faster on both serial and parallel platforms