weird CUDA performance regression with respect to streams

Hi there

we’re doing LTE 4G radio simulations with MIMO in downlink and uplink, running on CUDA. For every millisecond of simulated time we run one downlink and one uplink computation (for an entire area of base stations and mobile users). Uplink and downlink are using the same kernel calls, but run with different data sets (in downlink it’s base stations transmitting and in uplink it’s mobile devices transmitting)

For performance reasons I am overlapping the computations with the CUDA streams API. there’s 4 streams for downlink, 4 for uplink. so overall we get this repeating pattern (the stream IDs are made up). Every stream uses distinct buffers for input and output of course, and all memcpy operations are asynchronous and with page-locked host buffers.

downlink stream #1
uplink stream #5
downlink stream #2
uplink stream #6
downlink stream #2
uplink stream #7
downlink stream #3
uplink stream #8

With driver 304.88 the performance is cool. We’re using CUDA 5.0 on Kepler type devices currently (GT 660Ti, GT 650M, GT750M), running Ubuntu Linux (32bit and 64 bit variants, anything from Ubuntu 9.04 to 12.04).

With driver 319.23 and 325.15 performance is ridiculously low, about 10% of the performance we get with the 304.88 driver, running the same binary code. Only when we comment out the kernel launches (even when the kernels are stripped into a NO-OP) we get good performance again. We’ve done timing measurements between issuing the asynchronous commands per stream until we get the completion callback. These timings are ridiculously high with the later drivers. Strangely when we only do downlink or only uplink, the performance is good again. But when we interleave downlink and uplink we’re at snail’s pace with the new drivers. I’ve already tried reducing the total number of streams from 8 to 4, but no improvement there.

Dear nVidia, seem to have a performance regression there in your streams API. It will be a bit of work to condense this into a simple repro case for you. Unfortunately our MIMO simulation code has grown quite complicated already.


Has anyone else seem some kind of performance degradation with newer drivers?

currently trying the 331.13 beta driver for Linux and the speed is good. So likely the problem was already fixed and we don’t have to spend days to produce a simple repro case. phew ;)