How to run SHOC on Dual GPU ?

Hello everyone,

Anybody know how could I run the benchmark SHOC on a DUAL GPU ??
Like the NVIDIA K10 or K80… they appears to be two differents gpu, how do I use all the power of that GPU ?

SHOC has the ability to run various tests on multiple GPUs. Using a K80 would be conceptually the same as running SHOC on a machine with two K40’s in it.

It should be, but I’m having problems with this… On the server the Dual GPUs appears to be two different GPU.

So… to run the benchmark I tried to use the commando “perl ./tools/ -s 4 -d 0,1 -cuda” to use the devices 0 and 1… But it gives error :/ like bellow…

[shoc-master]$ perl ./tools/ -s 4 -d 0,1 -cuda
— Welcome To The SHOC Benchmark Suite version 1.1.5 —
Platform selection not specified, default to platform #0
Number of available platforms: 1
Number of available devices on platform 0 : 8
Device 0: ‘Tesla K80’
Device 1: ‘Tesla K80’
Device 2: ‘Tesla K80’
Device 3: ‘Tesla K80’
Device 4: ‘Tesla K80’
Device 5: ‘Tesla K80’
Device 6: ‘Tesla K80’
Device 7: ‘Tesla K80’
Specified 2 device IDs: 0,1
Using size class: 4

— Starting Benchmarks —
Running benchmark BusSpeedDownload
result for bspeed_download: BenchmarkError
Running benchmark BusSpeedReadback
result for bspeed_readback: BenchmarkError
Running benchmark MaxFlops
result for maxspflops: BenchmarkError
result for maxdpflops: BenchmarkError
Running benchmark DeviceMemory
result for gmem_readbw: BenchmarkError
result for gmem_readbw_strided: BenchmarkError
result for gmem_writebw: BenchmarkError
result for gmem_writebw_strided: BenchmarkError
result for lmem_readbw: BenchmarkError
result for lmem_writebw: BenchmarkError
result for tex_readbw: BenchmarkError
Skipping non-cuda benchmark KernelCompile
Skipping non-cuda benchmark QueueDelay
Running benchmark FFT
result for fft_sp: BenchmarkError
result for fft_dp: BenchmarkError
Running benchmark GEMM
result for sgemm_n: BenchmarkError
result for dgemm_n: BenchmarkError
Running benchmark MD
result for md_sp_flops: BenchmarkError
result for md_dp_flops: BenchmarkError
Running benchmark MD5Hash
result for md5hash: BenchmarkError
Running benchmark Reduction
result for reduction: BenchmarkError
result for reduction_dp: BenchmarkError
Running benchmark Scan
result for scan: BenchmarkError
result for scan_dp: BenchmarkError
Running benchmark Sort
result for sort: BenchmarkError
Running benchmark Spmv
result for spmv_csr_scalar_sp: BenchmarkError
result for spmv_csr_vector_sp: BenchmarkError
result for spmv_ellpackr_sp: BenchmarkError
result for spmv_csr_scalar_dp: BenchmarkError
result for spmv_csr_vector_dp: BenchmarkError
result for spmv_ellpackr_dp: BenchmarkError
Running benchmark Stencil2D
result for stencil: BenchmarkError
result for stencil_dp: BenchmarkError
Running benchmark Triad
result for triad_bw: BenchmarkError
Running benchmark S3D
result for s3d: BenchmarkError
result for s3d_dp: BenchmarkError
Running benchmark QTC
result for qtc: BenchmarkError
result for qtc_kernel: BenchmarkError

Yes, on the server side, they appear to be two different GPUs, that is correct.

I don’t think the BenchmarkError has anything to do with the use of K80 or dual GPU cards. You’ll need to investigate that separately. I believe SHOC generates more detailed logs that can be useful.

Where is that log ?

Did you try the Logs directory? Look for files ending in .err in that directory.

My guess is you don’t have mpi properly configured.

On the logs says that “./bin/EP/CUDA/BusSpeedDownload (No such file or directory)” How can I create this files? it’s a different way to compile??

actually I have found another log now and it says

There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:

Either request fewer slots for your application, or make more slots available
for use.

How can I make the slots available?

Doubt this thread is still live, but for future reference…

  1. The error about slots comes from OpenMPI. Using SHOC with multiple cards (single node or multiple nodes) requires MPI, and apparently the SHOC driver found an mpirun associated with OpenMPI. Regardless of whether you build SHOC with MPI or not, if you target 2+ cards, it will call mpirun. I’d say that’s a design flaw, erroring out with a useful message when trying to run MPI-less SHOC on multiple cards would be preferable to the current state, IMO. Anyway, OpenMPI won’t run more than 1 rank per hostname provided, and by default it’s probably getting 1 hostname. 2 cards require 2 ranks, so it doesn’t have enough “slots.” You’d probably have to supply a hostfile with the node name repeated N times, one per line, once for each card to be targeted.

Note, that’s cards, not GPUs. If you try to run 2 GPUs on a single card, MPI-less SHOC does fine (at least in my experience as of yesterday).

  1. To use multiple cards, you need to build SHOC with the “–with-mpi” option. That triggers the …/EP/… build.