I’ve got a large 3d array of frame buffers. There are (10 * m->numtriangles) buffers (5 framebuffers, 5 depthbuffers per triangle). The elements of the buffers are unsigned shorts. I allocate them like this:
[codebox]
cudaPitchedPtr deviceBuffer;
cudaExtent bufferExtent = make_cudaExtent(sizeof(unsigned short) * FRAMEBUFFER_WIDTH, FRAMEBUFFER_HEIGHT, m->numtriangles * 10);
if ((ret = cudaMalloc3D(&deviceBuffer, bufferExtent)) != cudaSuccess) {
printf("Couldn't allocate for buffer in device!\nMemory error: %s\n", cudaGetErrorString(ret));
quit(1);
}
[/codebox]
FRAMEBUFFER_WIDTH and FRAMEBUFFER_HEIGHT are both 50. This means the logical width is 100 (2 bytes per short * 50) and the pitch returned is 128 (28 unused bytes).
Now I need to clear them all to the 0xFFFF unsigned short value (equivalently, every byte will need to be set to 0xFF):
[codebox]
if ((ret = cudaMemset3D(deviceBuffer, 0xFF, bufferExtent)) != cudaSuccess) {
printf("Couldn't set buffer in device!\nMemory error: %s\n", cudaGetErrorString(ret));
quit(1);
}
[/codebox]
When I do this, I have a problem that a portion of the buffer memory is not cleared! The last few buffers will not be overwritten with 0xFF. I tried setting the extents so that
[codebox]
bufferExtent.depth = 1;
[/codebox]
in order to try clearing just one slice. Here are the results (the white is the previous buffer contents, the gray is value I’m passing to cudaMemset3D (changed from white so it’s easier to see), and the first slice is the leftmost square).
(sorry the pictures are kind of small)
Notice that the entire buffer is not overwritten. By measuring the size of the white and gray areas, almost exactly 20% is not overwritten. Sound familiar? (128 - 100) / 128 ~ 20%. The ratio of area that is not written is the same as the ratio of memory that is unused (to align with 128 byte rows)! It looks like the values are being written but no padding is being used, so every row is written into logical (100) bytes, instead of pitch (128) bytes. Now, if I set the height to 49 instead of 50:
Note that the proper number of lines is filled. In fact, I only see problems when the extent height is the same as the height of the 3D array.
If I increase the depth to 2:
It does appear that cudaMemset3D is ‘forgetting’ about the padding in the pitch. The last part of buffer 1 is now filled (with values that should be going into buffer 2) and buffer 2 is showing twice as much empty space. I can verify that the number of unwritten buffers is approximately 20% (the ammount of unused space in the pitch, as we calculated earlier). Again, if I set the height to 49 we see correct results:
I can fill all the buffers by doing them in 2 passes if I want; the first passes fills the top half of the buffers, and the second pass fills the bottom half. This works, since the height of half the buffer is obviously not the same as the true extent/buffer height, but this seems much much slower. In fact, the reference manual mentions that it may be faster to perform a memset if the extent.height and extent.width parameters match up with the pitchedPtr.xsize and pitchedPtr.ysize values; unfortunately, this is exactly the case where I get incorrect results!!
So, the bottom line is, is this a bug in cudaMemset3D (and if so, what is a workaround)? or am I just doing something incorrectly? Thanks for your help!!
Mark
PS Can I used cudaMemset2D to do the same task, as long as I adjust the height to compensate for the number of slices?