Performance tuning on multi-GPU with CUDA CUT_THREADPROC

Hi, everybody:

   We're working on a 8-GPU workstation to accelerate some applications. However, it looks the scalability within a single server is still an issue.  

   Our application could finish in 2 seconds on 1 Tesla C1060 card, which shows nice speedup. Since our dataset can totally be divided into unrelated small tiles, we scale to 8 cards.  However, with 8 cards, it looks some other overhead grows a lot. We can only get about 0.8s. The efficiency is pretty low.  

  We use both CUT_THREADPROC and pthread. They all showed similar performance. So are there any help with the performance tuning, dealing with cutting the some hidden overhead low?

   I guess something in the driver still works sequentially, like data transfer?

 That's a pretty bothering situation. Calling for help.  :wacko: 

Thanks!

Senosy

Hi, everybody:

   We're working on a 8-GPU workstation to accelerate some applications. However, it looks the scalability within a single server is still an issue.  

   Our application could finish in 2 seconds on 1 Tesla C1060 card, which shows nice speedup. Since our dataset can totally be divided into unrelated small tiles, we scale to 8 cards.  However, with 8 cards, it looks some other overhead grows a lot. We can only get about 0.8s. The efficiency is pretty low.  

  We use both CUT_THREADPROC and pthread. They all showed similar performance. So are there any help with the performance tuning, dealing with cutting the some hidden overhead low?

   I guess something in the driver still works sequentially, like data transfer?

 That's a pretty bothering situation. Calling for help.  :wacko: 

Thanks!

Senosy

Hi,

Its hard to tell without real code, but I guess one of the most limiting factors would be the PCI overhead.

If you move a lot of data back and forth between the CPU and GPU and some of the cards share the same PCI lane your performance will be loosy.

Is that the case in your application? you should time exactly what takes the most time with the 8 cards (and consult with the IT person on how those

cards are configured) and then decide what to do next.

eyal

Hi,

Its hard to tell without real code, but I guess one of the most limiting factors would be the PCI overhead.

If you move a lot of data back and forth between the CPU and GPU and some of the cards share the same PCI lane your performance will be loosy.

Is that the case in your application? you should time exactly what takes the most time with the 8 cards (and consult with the IT person on how those

cards are configured) and then decide what to do next.

eyal