Parallelization schemes What schemes do you use when processing large datasets?

aka_Falsh · December 21, 2010, 12:01am

Hi, everybody!
In the process of using GPU-accelerated cluster for processing large amount of data I`ve tried to use only two parallelization schemes:

1 CPU - N GPU
N CPU - N GPU

If there any other variants it would be good to collect them in this topic.

Simon_Green · December 21, 2010, 1:24pm

0 CPU, N GPU?

tera · December 21, 2010, 1:27pm

M CPU, N GPU? External Image

Dittoaway · December 21, 2010, 2:47pm

0 CPU, 0 GPU

I’ve found this to be quite slow though :)

aka_Falsh · December 22, 2010, 9:56pm

Rather funny solution, but Ive started to search even some books for programming patterns on GPU and found nothing. Thats why I started such topic. Now I am trying to test some variants of patterns (Parallelization schemes), but Ive stoped at only 2 ones.

Simon_Green · December 23, 2010, 9:14am

Seriously, parallelizing algorithms on the GPU is more about things like deciding the number of data items processed per thread and the number of threads per block (to make efficient use of the hardware) than simple stuff like the number of CPUs and GPUs (although this obviously has a big effect on clusters). The CUDA best practices guide has plenty on this subject.

I’m not a fan of design “patterns” in general.

examachine · December 23, 2010, 8:53pm

The developer guide seems to suggest a two-level parallelization suitable for the virtual architecture. What the SIMT model demands you to do is, fine-grain decomposition at the warp level and a coarse-grain shared memory parallelism at the block level, I think it’s better to try to separate these in the code as much as possible so you can combine the parallel algos later on in different ways. It’s also smart to make it as device-agnostic as possible so you can have good performance on any card with sufficient capability. All very nicely covered in the best practices guide as you suggest. CUDA introduces a lot of architectural complexity to deal with, probably on par with SMP clusters.

I think this architecture is great for trying out new programming paradigms and compiler research, as well. I’ve been playing around with complex kernels, the idea being porting a coarse-grain parallel algorithm from MPI to kernel level, I’m still curious how good it will perform without any fine-grain algorithms! So, my approach is, thinking similarly to MPI processes at the block level, and if I need any more performance I will try to substitute some lower-level algorithms with fine-grain variants…

Just my 2 cents.

Cheers,

Eray