I believe that this is a bug in the nvidia driver when it has multiple threads/streams so I’d suggest that you don’t waste too much time trying to debug it - I found disabling L1 Cache helped in some cases but I’ve reproduced it with both cache enabled and disabled; it appears to be prominent when host mapped memory gets loaded directly into local memory.
I submitted a bug report a month ago - bug id 712753.
I believe that this is a bug in the nvidia driver when it has multiple threads/streams so I’d suggest that you don’t waste too much time trying to debug it - I found disabling L1 Cache helped in some cases but I’ve reproduced it with both cache enabled and disabled; it appears to be prominent when host mapped memory gets loaded directly into local memory.
I submitted a bug report a month ago - bug id 712753.