I’m experimenting with multi-GPU computing, and have some questions about the CUDA SDK example C/src/MonteCarloMultiGPU/ – maybe someone familiar with that example, or using CUDA across multiple cards can help shed some light. I’ve attached the original SDK file MonteCarloMultiGPU.0.cpp
My setup is not the most ideal/symmetric, but here’s the device ordering:
dev0 C2070
dev1 C1060
dev2 C1060
dev3 C2070
This will affect experiments below in those middle cards will have lower FLOPs relative to the first and last.
I upped the number of options from 256 to 16384 (=64*256), and to keep this forum topic simple for now, I’m only considering the streamed version (not threaded).
First off, I think there’s a bug in the example’s threaded printout. When it prints the options calculated by each thread/GPU, it uses the total number of options (N_OPT), not the number of options each GPU ended up calculating (optionSolver[i].optionCount). Fixed that in the results below. Code attached as MonteCarloMultiGPU.1.cpp
The experiment: I want to see if adding more GPUs increased the example throughput, so I made a tweak to the example to add a parameter that lets you specify how many GPUs to use (by default it would use as many as available). Here is the output:
bash$ for n in `seq 1 4`; do echo $n; ./MonteCarloMultiGPU -ngpu=$n -type=single -method=streamed | egrep "Options per|threaded|streamed"; done
1
main(): GPU statistics, streamed
Options per sec.: 143719298.170471
2
main(): GPU statistics, streamed
Options per sec.: 111455782.403310
3
main(): GPU statistics, streamed
Options per sec.: 129007869.898552
4
main(): GPU statistics, streamed
Options per sec.: 39574878.794028
Notice that using more GPUs tends to decrease throughput.
Any suggestions of things I could try next? Maybe I’m not evaluating things properly?
MonteCarloMultiGPU.1.cpp (15.1 KB)
MonteCarloMultiGPU.0.cpp (14.8 KB)