Hello, we’ve been experiencing consistently reproducible hardware lock-ups (driver resets after Xid 8 if attached to display, otherwise the GPU just stalls) when running thrust::sort_by_key with a custom comparator function on Titan X GPUs (Maxwell architecture, compute capability 5.2). The problem only manifests when sorting arrays with millions of elements, multiple keys and a custom comparator function. All our Titan X GPUs seem to be affected, and in all cases there is no sign of thermal issues, indicating (from the Xid value of 8) that this is either a problem in the driver or a problem in the (thrust) code. I’m not sure where the bug resides, so I’m reporting this on “both sides”.
Sample code to reproduce the issue is non trivial, so I’ve created the github project https://github.com/Oblomov/titanxstall to host a sample program (Linux only, but if anyone wishes to adapt it to run on other platforms, please do). Running the program just repeatedly sorts the arrays (and scrambles them again) forever (or until the GPU locks up). With the default settings, the device usually locks up after less than a hundred thousand iterations (less than 10 minutes), but sometimes it locks up as quickly as 2K iterations. Other architectures (Fermi, Kepler) seem to not be affected. The problem also manifests with as few as 1024×1024 elements, in longer runs.
The test program can optionally be run with a custom caching allocator (based on the one in the thrust examples) as an option to verify that the problem manifests even without the continuous allocation/deallocation done by thrust::sort_by_key.