Parallelization schemes What schemes do you use when processing large datasets?

Hi, everybody!
In the process of using GPU-accelerated cluster for processing large amount of data I`ve tried to use only two parallelization schemes:

  1. 1 CPU - N GPU
  2. N CPU - N GPU

If there any other variants it would be good to collect them in this topic.

0 CPU, N GPU?

M CPU, N GPU? External Image

0 CPU, 0 GPU

I’ve found this to be quite slow though :)

Rather funny solution, but Ive started to search even some books for programming patterns on GPU and found nothing. Thats why I started such topic. Now I am trying to test some variants of patterns (Parallelization schemes), but Ive stoped at only 2 ones.

Seriously, parallelizing algorithms on the GPU is more about things like deciding the number of data items processed per thread and the number of threads per block (to make efficient use of the hardware) than simple stuff like the number of CPUs and GPUs (although this obviously has a big effect on clusters). The CUDA best practices guide has plenty on this subject.

I’m not a fan of design “patterns” in general.

The developer guide seems to suggest a two-level parallelization suitable for the virtual architecture. What the SIMT model demands you to do is, fine-grain decomposition at the warp level and a coarse-grain shared memory parallelism at the block level, I think it’s better to try to separate these in the code as much as possible so you can combine the parallel algos later on in different ways. It’s also smart to make it as device-agnostic as possible so you can have good performance on any card with sufficient capability. All very nicely covered in the best practices guide as you suggest. CUDA introduces a lot of architectural complexity to deal with, probably on par with SMP clusters.

I think this architecture is great for trying out new programming paradigms and compiler research, as well. I’ve been playing around with complex kernels, the idea being porting a coarse-grain parallel algorithm from MPI to kernel level, I’m still curious how good it will perform without any fine-grain algorithms! So, my approach is, thinking similarly to MPI processes at the block level, and if I need any more performance I will try to substitute some lower-level algorithms with fine-grain variants…

Just my 2 cents.

Cheers,

Eray