Edit: TL;DR: CUDA bug. see details in below link
So I’ve reduced this issue to the point where nearly all the context provided above is unnecessary. I would like to close this topic, but it doesn’t appear I have permission. So I’m just going to open another one based on the minified repro.