cudaMemset3D bug?

I’ve got a large 3d array of frame buffers. There are (10 * m->numtriangles) buffers (5 framebuffers, 5 depthbuffers per triangle). The elements of the buffers are unsigned shorts. I allocate them like this:

[codebox]

cudaPitchedPtr deviceBuffer;

cudaExtent bufferExtent = make_cudaExtent(sizeof(unsigned short) * FRAMEBUFFER_WIDTH, FRAMEBUFFER_HEIGHT, m->numtriangles * 10);

if ((ret = cudaMalloc3D(&deviceBuffer, bufferExtent)) != cudaSuccess) {

printf("Couldn't allocate for buffer in device!\nMemory error: %s\n", cudaGetErrorString(ret));

quit(1);

}

[/codebox]

FRAMEBUFFER_WIDTH and FRAMEBUFFER_HEIGHT are both 50. This means the logical width is 100 (2 bytes per short * 50) and the pitch returned is 128 (28 unused bytes).

Now I need to clear them all to the 0xFFFF unsigned short value (equivalently, every byte will need to be set to 0xFF):

[codebox]

if ((ret = cudaMemset3D(deviceBuffer, 0xFF, bufferExtent)) != cudaSuccess) {

printf("Couldn't set buffer in device!\nMemory error: %s\n", cudaGetErrorString(ret));

quit(1);

}

[/codebox]

When I do this, I have a problem that a portion of the buffer memory is not cleared! The last few buffers will not be overwritten with 0xFF. I tried setting the extents so that

[codebox]

bufferExtent.depth = 1;

[/codebox]

in order to try clearing just one slice. Here are the results (the white is the previous buffer contents, the gray is value I’m passing to cudaMemset3D (changed from white so it’s easier to see), and the first slice is the leftmost square).

(sorry the pictures are kind of small)

Notice that the entire buffer is not overwritten. By measuring the size of the white and gray areas, almost exactly 20% is not overwritten. Sound familiar? (128 - 100) / 128 ~ 20%. The ratio of area that is not written is the same as the ratio of memory that is unused (to align with 128 byte rows)! It looks like the values are being written but no padding is being used, so every row is written into logical (100) bytes, instead of pitch (128) bytes. Now, if I set the height to 49 instead of 50:

Note that the proper number of lines is filled. In fact, I only see problems when the extent height is the same as the height of the 3D array.

If I increase the depth to 2:

It does appear that cudaMemset3D is ‘forgetting’ about the padding in the pitch. The last part of buffer 1 is now filled (with values that should be going into buffer 2) and buffer 2 is showing twice as much empty space. I can verify that the number of unwritten buffers is approximately 20% (the ammount of unused space in the pitch, as we calculated earlier). Again, if I set the height to 49 we see correct results:

I can fill all the buffers by doing them in 2 passes if I want; the first passes fills the top half of the buffers, and the second pass fills the bottom half. This works, since the height of half the buffer is obviously not the same as the true extent/buffer height, but this seems much much slower. In fact, the reference manual mentions that it may be faster to perform a memset if the extent.height and extent.width parameters match up with the pitchedPtr.xsize and pitchedPtr.ysize values; unfortunately, this is exactly the case where I get incorrect results!!

So, the bottom line is, is this a bug in cudaMemset3D (and if so, what is a workaround)? or am I just doing something incorrectly? Thanks for your help!!

Mark

PS Can I used cudaMemset2D to do the same task, as long as I adjust the height to compensate for the number of slices?

I believe that cudaExtent is defined in terms of array elements, not byte size.

N.

Hmm… the reference manual documentation for struct cudaExtent says each element, width, height and depth, is in bytes. Even if it was in array elements, it still wouldn’t explain the results I’m seeing. Also, the other 3D functions, such as cudaMemcpy3D, work properly with extents defined in bytes.

That’s possible, page 39 of the reference manual states:

So it makes sense if you’re not using arrays, kind of confusing though :)

N.

Hmm… well that is a bit confusing with the part of the manual that I was looking at. That is from the documentation for cudaMemcpy3D, while I am using cudaMemSET3D. However, I have used cudaMemcpy3D successfully, and that call (for me) does use bytes, as I am using linear memory allocated with cudaMalloc3D, not array memory allocated with cudaMalloc3DArray. You’ll notice that the page you refer to says that when an array is not part of the copy, extents are defined in terms of elements of unsigned char - which are always 1 byte, hence extents are defined in terms of bytes.

However, this is rather a moot point since I’m using cudaMemset3D, and I can clearly see that byte-fill behaviour is being used, such as when I use a height that is not the full height of the extent (and it is being filled properly).

Thanks for your help, and looking into this issue with me!

Mark

Hi again,

If you look at page 50 of the reference manual:

So my guess is that this could be an optimization bug.

N.

Yes, that was my guess too

I guess I didn’t say specifically I thought the bug was in the optimized part of the call. Who should I report this to? I seem to have a workaround in place in my code, but for now maybe a developer could help me with this, let me know if it is a bug in cudaMemset3D.

Mark

Have you tried visualizing the padded slice instead of the 50x50 slice to get a better idea of what’s going on?

N.

All right, it took a little hacking away, cause it wasn’t supported in my original code, but here’s a visualization of the fully padded/pitched memory:

This is how the buffer should look after processing. As you can see, there are 14 pixels to the right that are unfilled (the buffer is initially filled with white). Buffer width in elements is 50. Logical byte width is 100 since elements are unsigned shorts, and pitch is 128. There are 28 padded bytes, giving 14 extra 2-byte columns in the buffer.

Here is the buffer being written over with a gray color by cudaMemset3D. The extents are set as the full width and height, and 1 slice:

As you can see, the padding is being filled, and the true buffer portion is not completely wiped.

Here is the buffer being filled with the extents set to the true width, 1 slice and the depth - 1:

No bug in this one.

I’m absolutely convinced now that there is a bug in the optimized memset path that occurs when the extents match the size of the buffer in width and height.

Mark

I’m convinced as well, here’s my test code:

[codebox]

#include <stdio.h>

#define FRAMEBUFFER_WIDTH 50

#define FRAMEBUFFER_HEIGHT 50

#define FRAMEBUFFER_DEPTH 2

int main(){

cudaSetDevice(0);

cudaError_t ret;

cudaPitchedPtr deviceBuffer;

cudaExtent bufferExtent = make_cudaExtent(sizeof(unsigned short) * FRAMEBUFFER_WIDTH, FRAMEBUFFER_HEIGHT, FRAMEBUFFER_DEPTH);

if ((ret = cudaMalloc3D(&deviceBuffer, bufferExtent)) != cudaSuccess) {

	printf("Couldn't allocate for buffer in device!\nMemory error: %s\n", cudaGetErrorString(ret));

	exit(1);

}

srand(time(0));

unsigned char byteval = rand()&0xFF;

unsigned short shortval = byteval | (byteval<<8);

const unsigned int bytesize = deviceBuffer.pitch*FRAMEBUFFER_HEIGHT*FRAMEBUFFER_DEPTH;

fprintf(stderr,"deviceBuffer:\npitch \t= %d\nptr \t= %08x\nxsize \t= %d\nysize \t= %d\n",deviceBuffer.pitch,deviceBuffer.ptr,deviceBuffer.xsize

,deviceBuffer.ysize);

//! This one works

cudaMemset2D(deviceBuffer.ptr, deviceBuffer.pitch, byteval, FRAMEBUFFER_WIDTH*sizeof(unsigned short), FRAMEBUFFER_DEPTH*FRAMEBUFFER_HEIGHT);

//! This one doesn't

//cudaMemset3D(deviceBuffer, byteval, bufferExtent);

unsigned short* hdata;

cudaMallocHost((void**)&hdata,bytesize);

cudaMemcpy(hdata,deviceBuffer.ptr,bytesize,cudaMemcpyDeviceT

oHost);

unsigned int shortpitch = (deviceBuffer.pitch>>1);

for (unsigned int z = 0;z<FRAMEBUFFER_DEPTH;++z)

	for (unsigned int y = 0;y<FRAMEBUFFER_HEIGHT;++y)

		for (unsigned int x = 0;x<FRAMEBUFFER_WIDTH;++x) {

			unsigned short tmp = hdata[z*FRAMEBUFFER_HEIGHT*shortpitch + y*shortpitch + x];

			if (tmp!=shortval) {

				fprintf(stderr,"Error: [%4d,%4d,%4d] -> %08x %08x\n",x,y,z,tmp,shortval);

				break;

			}

		}

cudaFreeHost(hdata);

cudaFree(deviceBuffer.ptr);

}

[/codebox]

N.

Thanks for the nice test case to prove it. A quick question… if you set your depth to say, 10000, did you notice that cudaMemset2D over the entire 3D space is much slower than cudaMemset3D? In my app it’s so slow that it’s quicker for me to fill up a temporary buffer, say WIDTH * HEIGHT * 10 in host memory and copy it over until I’ve overwritten enough device memory to simulate the memset. However cudaMemset3D worked quickly (and buggy, of course :P )

Mark

Since you’re not using the padded values anyway, wouldn’t it be faster to use a 1D cudaMemset which includes the padding?

Can’t test it myself right now…

N.

I guess I could try it, I just assumed that since cudaMemset2D was so slow, that a 1D version might be slow as well. After all, why would there be a cudaMemset3D if there wasn’t some benefit to using a certain method over 3D memory? Just a thought, maybe I’ll try it out later today if time permits, I have lots of other stuff to do first.

Mark