I have a cuda kernel that reads in data from an unsigned char device buffer, in order to fill in shared memory. This is an image processing app, and the width and height of the images, as well as the memory pitch (or linesize) of the buffers changes as a function of the input image size. For certain image sizes, the computer crashes when this part of the code is executed.
I know the exact size of the device buffers, as I allocated them myself, and we went so far as to put in range-checking code on every access to the device buffer, to prevent it going out of range.
In spite of these precautions, it still crashes for certain height, width and pitch combinations.
I read that fetching data happens in blocks of 32 words - now, if I am fetching unsigned chars, is it fetching 32 chars, or 128 chars (i.e. equivalent 32 integers)?
Is there any way that these coalesced reads are causing it to attempt to access some part of the input buffer that is actually out of range, causing the crash?