I’m trying to make a little CUDA sample showing the HyperQ improvement when the GPU is attacked by several MPI processes. My case is really basic: only one kernel launched on my Tesla K20 by each MPI process. The kernel does not use all the GPU capabilities (occupancy around 6%), so, theoretically some executions should be done concurrently. It seems to be easy but after many tries it is still impossible to obtain the expected behavior, all the kernels are always executed serially…
- Maybe (or surely :)) I’m forgetting something in my implementation… Is there a special trick to activate HyperQ on GK110 arch?
- Does someone have a simple sample which shows me how to use HyperQ feature with MPI?
- Ubuntu 12.04
- Tesla K20
- Latest CUDA driver & toolkit
- Open MPI 1.4.3
Thanks for your help !