Tesla Compute Cluster driver

I have some questions about the Tesla compute cluster driver:

  1. “Reducing kernel launch overhead” - how much does this help? Does the overhead have anything to do with the ~10us I found here?

  2. The notes say you have to use a non-NVIDIA display driver if you want a display, but why? I know Windows Display Driver Model 1.0 (pre Windows 7) only supports 1 driver, but WDDM 1.1 supports > 1 driver. This would be a major inconvenience, because currently I use a Quadro 290 for the display.

  1. If you don’t have strict latency requirements, you might not notice a huge change (in large part because we batch when possible on WDDM, which amortizes a lot of the cost of submitting GPU work). However, for iterative algorithms with relatively short kernel invocations, this can make a major performance difference.

  2. Basically there are different components for TCC and WDDM drivers, and Windows gets unhappy when you have two drivers with the same component names. I improved this significantly in CUDA 3.2, so the whole convenience thing is fixed.

  1. If you don’t have strict latency requirements, you might not notice a huge change (in large part because we batch when possible on WDDM, which amortizes a lot of the cost of submitting GPU work). However, for iterative algorithms with relatively short kernel invocations, this can make a major performance difference.

  2. Basically there are different components for TCC and WDDM drivers, and Windows gets unhappy when you have two drivers with the same component names. I improved this significantly in CUDA 3.2, so the whole convenience thing is fixed.

Good, that’s what I need for my median / SelectNth code with lots of global synchronization (kernel launches)

Good, that’s what I need for my median / SelectNth code with lots of global synchronization (kernel launches)

Yeah if you’re doing a loop of kernel → memcpy → check to see if a convergence condition is met → repeat, TCC is going to kill WDDM in terms of performance here.

Yeah if you’re doing a loop of kernel → memcpy → check to see if a convergence condition is met → repeat, TCC is going to kill WDDM in terms of performance here.