we’re running a cudppCompact operation on an array of 65536 uint values using the CUDPP 2.1 library built for Compute 2.0, 3.0 and 3.5 all in one binary. This runs smoothly on all of these device categories.
Weirdly enough, when running this on a Compute 5.0 device such as a GTX 750 Ti, the thing takes minutes to just-in-time compile the code - and on Windows it even crashes (in all likelyhood due to a stack overrun). In debug mode I get a message that PTXAS has run out of memory before it crashes, in Release builds it just crashes.
What’s with the JIT compilation for Maxwell taking so immensely long?