Very strange behaviour ... accessing pitch linear memory as uchar4


I’ve encountered some very strange behaviour since upgrading to CUDA 2.2 (running windows xp, 32bit, 185.85).

Usually, when accessing memory allocated via cuMemAllocPitch as if it were a 2D array - each row is pitch-bytes in length, thus memory must be accessed something along the lines of: T* pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column; (copy/pasted from docos)

And this has worked for me 100% of the time previously, and still works for me when using primitive types (unsigned char, unsigned int, unsigned long long, etc), HOWEVER - since CUDA 2.2, when accessing memory as if it’s not a primitive type (eg: T = uchar4), I have to access memory as if it’s NOT pitch linear, and the pitch is the actual width of the row, not the pitch… (Similarly, 2D textures to pitch linear memory (of all formats) for some reason have a width of their pitch, not the width specified in the texture - but that’s another problem for another day…)

For those who don’t like reading what I said, and prefer code…[codebox]// Note: ‘data’ for both of these functions is the same memory pointer

device void foo(uchar4 *data, unsigned int width_bytes, unsigned int pitch_bytes)


uint2 tid = make_uint2(threadIdx.x, threadIdx.y);

// Note: data is allocated as pitch linear memory - but accessing it as such as a uchar4 does not work as intended? why?

// Note: this used to work previously... not sure why it's changed.

uchar4 works = data[(tid.y * (width_bytes / 4)) + tid.x]; // should NOT work (but it does, and gets the correct pixels at [x,y])

uchar4 breaks = data[(tid.y * (pitch_bytes / 4)) + tid.x]; // SHOULD work, getting the intended pixels at [x,y]


device void foo(unsigned int *data, unsigned int width_bytes, unsigned int pitch_bytes)


uint2 tid = make_uint2(threadIdx.x, threadIdx.y);

// Note: data is allocated as pitch linear memory - accessing it as such does as an unsigned int DOES work as intended… as expected!

unsigned int breaks = data[(tid.y * (width_bytes / 4)) + tid.x]; // should NOT work (and doesn't work, yay)

unsigned int works = data[(tid.y * (pitch_bytes / 4)) + tid.x]; // SHOULD work (and does work, yay again)


I’m hoping this is a driver bug, otherwise my understanding of CUDA has flipped upside down since 2.2 :\

(Also note, uchar4 is 4bytes, as is an unsigned int, when compiling with nvcc/msvc/gcc/whatever - so my hard-coded 4’s are valid in this case)

Edit: I’m less convinced this is to do with the fact I’m accessing the data as uchar4 now, as I’ve been able to replicate the same issue with unsigned int now… Somewhere between allocating the memory (cuMemAllocPitch, returning a pitch not equal to the width bytes) - and my kernels, the memory somehow lost it’s pitch, if that’s even possible…

Aha, I had the same problem a while ago when I tried texturing from pitch linear memory in order for it to be cached,… strange behaviour indeed.


Well, I’ve managed to confirm this isn’t a bug with just this kernel - I get the same issue in ANY of my kernels that use cuMemAllocPitch - the only reason I didn’t notice this sooner is most of the memory sizes I allocate are already properly aligned (320x240bytes, for example).

I’ll test downgrading my drivers and see if it fixes any issues, in which case it’s certainly a driver bug. (Hoping I won’t have to recompile my codebase, to run a 2.2 compiled app on older drivers (even if I’m not using 2.2 features)).

Will report back with more details - but I’m seriously surprised no one else has run in to this issue (besides Nico :P) (I can’t imagine many cases a CUDA developer WOULDNT use cuMemAllocPitch …) - which makes me think it can’t be a bug confused

Update: Well unfortunately (but logically), it doesn’t appear to be a driver related bug - so my only conclusion can now be I’m somehow corrupting the internal state of CUDA… (downgraded to CUDA 2.1 toolkit, and quite old (Pre 2.0?) drivers, 182.50 - and the problem remains)

Somehow, between calling cuMemAllocPitch, and then calling a kernel passing in the pitch returned from cuMemAllocPitch to use in the kernel to as stated above (first post) - my memory seems to have been re-allocated as linear memory, instead of pitch linear memory - and I don’t understand how that’s remotely possible.

Do you have a small piece of compilable code that illustrates the problem, maybe I can have another look at it.