MonteCarloMultiGPU from SDK Judging performance: threads v. streams

I’m experimenting with multi-GPU computing, and have some questions about the CUDA SDK example C/src/MonteCarloMultiGPU/ – maybe someone familiar with that example, or using CUDA across multiple cards can help shed some light. I’ve attached the original SDK file MonteCarloMultiGPU.0.cpp

My setup is not the most ideal/symmetric, but here’s the device ordering:

dev0 C2070

dev1 C1060

dev2 C1060

dev3 C2070

This will affect experiments below in those middle cards will have lower FLOPs relative to the first and last.

I upped the number of options from 256 to 16384 (=64*256), and to keep this forum topic simple for now, I’m only considering the streamed version (not threaded).

First off, I think there’s a bug in the example’s threaded printout. When it prints the options calculated by each thread/GPU, it uses the total number of options (N_OPT), not the number of options each GPU ended up calculating (optionSolver[i].optionCount). Fixed that in the results below. Code attached as MonteCarloMultiGPU.1.cpp

The experiment: I want to see if adding more GPUs increased the example throughput, so I made a tweak to the example to add a parameter that lets you specify how many GPUs to use (by default it would use as many as available). Here is the output:

bash$ for n in `seq 1 4`; do echo $n; ./MonteCarloMultiGPU -ngpu=$n -type=single -method=streamed | egrep "Options per|threaded|streamed"; done

1

main(): GPU statistics, streamed

Options per sec.: 143719298.170471

2

main(): GPU statistics, streamed

Options per sec.: 111455782.403310

3

main(): GPU statistics, streamed

Options per sec.: 129007869.898552

4

main(): GPU statistics, streamed

Options per sec.: 39574878.794028

Notice that using more GPUs tends to decrease throughput.

Any suggestions of things I could try next? Maybe I’m not evaluating things properly?
MonteCarloMultiGPU.1.cpp (15.1 KB)
MonteCarloMultiGPU.0.cpp (14.8 KB)

Same result with the 285.03 driver (stream performance seems serialized) but a little more consistent timings.

bash$ for n in `seq 1 4`; do echo $n; ./MonteCarloMultiGPU -ngpu=$n -type=single -method=streamed | egrep "Options per|streamed"; done

1

main(): GPU statistics, streamed

Options per sec.: 159067960.980948

2

main(): GPU statistics, streamed

Options per sec.: 146285709.770358

3

main(): GPU statistics, streamed

Options per sec.: 140034189.603651

4

main(): GPU statistics, streamed

Options per sec.: 111455782.403310

I believe I was wrong in correcting the way the threaded version calculates options-per-second: it should still use total options (total across all GPUs) in the numerator, even though each GPU is computing only a fraction of those. Instead of dividing by total time, we want to divide by each GPU’s time so we see the fastest path.

I rearranged the GPUs so there are only two: dev0=C2070, dev1=C1060

I took the original SDK code (attached as MonteCarloMultiGPU.0.cpp) and make two modifications (attached MonteCarloMultiGPU.1.cpp):

[list=1]

[]Increase number of options being evaluated from OPT_N=256 to OPT_N=64256 to produce more appreciable timing differences

Change printf() to use scientific notation ("%f" --> “%.3e”) to produce clearer printouts for comparison.

# streamed

Options per sec.: 1.655e+08   # C2070 + C1060

# threaded

Options per sec.: 4.076e+07   # C2070

Options per sec.: 2.923e+06   # C1060

From this, it seems better to use one host thread to control all GPUs, rather than a thread for each GPU (in both cases using cudaSetDevice and the default stream 0).

I make one modification now: instead of automatically using all GPUs (orginal code, results above), you can pass in an argument specifying how many GPUs to use (attached MonteCarloMultiGPU.2.cpp)

# streamed

1

Options per sec.: 1.841e+08  # C2070

2

Options per sec.: 1.655e+08  # C2070 + C1060

# threaded

1

Options per sec.: 3.732e+07  # C2070

2

Options per sec.: 4.158e+07  # C2070

Options per sec.: 2.990e+07  # C1060

As expected, the results for two GPUs (below each “2” printout) are consistent with the above original runs.

Two interesting items here:

[list=1]

For the streamed version, using two GPUs to compute the same total number of options results in a slow down.

For the threaded version, using one GPU is significantly slower than the streamed version using the same single GPU.

Tentative conclusions:

[list=1]

It is better to use one host thread instead of a thread per GPU

Something about this example causes serialization

Anyone have an idea why using multiple GPUs on this “embarrassingly parallel” problem seems to only slow things down, or why streams are so much better than threads?

Jimi
MonteCarloMultiGPU.2.cpp (15 KB)
MonteCarloMultiGPU.0.cpp (14.8 KB)
MonteCarloMultiGPU.1.cpp (14.8 KB)