weird observation with CUDPP 2.1 - JIT compiler madness


we’re running a cudppCompact operation on an array of 65536 uint values using the CUDPP 2.1 library built for Compute 2.0, 3.0 and 3.5 all in one binary. This runs smoothly on all of these device categories.

Weirdly enough, when running this on a Compute 5.0 device such as a GTX 750 Ti, the thing takes minutes to just-in-time compile the code - and on Windows it even crashes (in all likelyhood due to a stack overrun). In debug mode I get a message that PTXAS has run out of memory before it crashes, in Release builds it just crashes.

What’s with the JIT compilation for Maxwell taking so immensely long?


Presumably the JITing is taking a long time because the code is really large. This could be because of the use of templates, however I am not familiar with CUDPP. The fact that the JIT compiler runs out of memory would also jibe with a working hypothesis of very large code size.

It sounds like the application as built as a fat binary that contains machine code for the sm_20, sm_30, and sm_35 architectures. So no JITing is necessary for these platforms. The best course of action would be to add sm_50 to the architecture targets of the fat binary build to avoid the need to JIT on Maxwell GPUs.

You may want to consider filing a bug regarding the compiler behavior on Windows. Orderly abnormal termination with an error message seems an appropriate response when the JIT compiler runs out of memory, but an outright crash should not happen.

seems the CUDPP team is considering adding official support for CUDA 6.0 and the sm_50 build target.