First invocation of complex thrust function has extended delay. Why?

Turns out there is something really flaky about the thrust::max_element which seems to be allocating memory, making it very slow for first invocation. A custom kernel replacement has been posted on stackoverflow to address this problem.

We have found that increasing the cuda stack size cuts the initial run time of our complete CDP kernel by 90% (300ms → 30ms), but it is still much slower than for subsequent invocations.