Strange behavior of cudaMemset

Hello,

I want to use the cudaMemset function to initialize an image with zeros. The single pixel values are stored as chars (only gray-scale). Of course the image has a specific width and hight. For example if I have an image of 800x600 than I allocate memory and afterwards I use cudaMemset(d_ptr_image, 0, width * hight * sizeof(char)) to set all pixel values to zero… this works perfectly.

Another example: Now I have an image of 802x602. I do the same as above but now the last 1198 pixels are not set to zero. For 804x604 images it works again. I think that this has something to do with the global memory structure, because if width and height are an multiplier of four there are no problems. Maybe CUDA will combine four chars into one float or something like that.

My question is if this problem is known by NVIDIA and if i am right. Would be nice if someone knows a work-around.
And this is really a bug in the CUDA API if there will be a fix in the next version.

I’m looking forward to receiving replies.

Best regards,
chrizh

Hello,

I want to use the cudaMemset function to initialize an image with zeros. The single pixel values are stored as chars (only gray-scale). Of course the image has a specific width and hight. For example if I have an image of 800x600 than I allocate memory and afterwards I use cudaMemset(d_ptr_image, 0, width * hight * sizeof(char)) to set all pixel values to zero… this works perfectly.

Another example: Now I have an image of 802x602. I do the same as above but now the last 1198 pixels are not set to zero. For 804x604 images it works again. I think that this has something to do with the global memory structure, because if width and height are an multiplier of four there are no problems. Maybe CUDA will combine four chars into one float or something like that.

My question is if this problem is known by NVIDIA and if i am right. Would be nice if someone knows a work-around.
And this is really a bug in the CUDA API if there will be a fix in the next version.

I’m looking forward to receiving replies.

Best regards,
chrizh

I have sold my problem …stupid failure by myself… first, when dealing with images it is recommended to use cudaMallocPitch() and for memory transfers cudaMemcpy2D() due to memory alignments. I used OpenCV to load an image and then copy it to the device. The data structure from OpenCV is named IplImage and there is also an memory alignment adjustment for performance purposes. So, when copying from device to host or vice versa you have to use the correct pitches… for the image on GPU that one returned from cudaMallocPitch and in the case of IplImage the widthStep attribute.

In conclusion it was an memory alignment failure by me :)

I have sold my problem …stupid failure by myself… first, when dealing with images it is recommended to use cudaMallocPitch() and for memory transfers cudaMemcpy2D() due to memory alignments. I used OpenCV to load an image and then copy it to the device. The data structure from OpenCV is named IplImage and there is also an memory alignment adjustment for performance purposes. So, when copying from device to host or vice versa you have to use the correct pitches… for the image on GPU that one returned from cudaMallocPitch and in the case of IplImage the widthStep attribute.

In conclusion it was an memory alignment failure by me :)

Interesting. So if I’m thinking about this correctly in the case of RGB images you wouldn’t be able to use cudaMallocPitch() like this because each (i,j) pixel value is a vector and not a scalar as in your gray-scale example.

Interesting. So if I’m thinking about this correctly in the case of RGB images you wouldn’t be able to use cudaMallocPitch() like this because each (i,j) pixel value is a vector and not a scalar as in your gray-scale example.

You can use cudaMallocPitch(), cause its much easier for high dimensional images like RGB. For instance, if you have an RGB image of 802 pixels width and 802 pixels height (3 channels) the number of bytes per row is not width * channels * sizeof(char), but a multiplier of 4 byte (32-bit). So the image is stored linear in the memory, like: R G B R G B R G B R … until the end of the row. There, a memory alignment is done, so that its easier and faster to access the single elements and of course the next row of pixels. Even cudaMalloc() or cudaMallocPitch() does this, because a single cell in global memory has a width of 32-bits (4*char).

As you said, when RGB is stored in a vector… that might be difficult how the values are stored in memory… but normally they a stored linear in the memory. So you can access them by simply iterate through the image array. But, be aware of memory alignments on host and device as I described above. With cudaMemcpy2D() you can tell CUDA the pitches, the actual number of bytes per row and the height of the image you want to copy from or to device. When you do this, everything is fine.

I hope, I could clarify my answer.

You can use cudaMallocPitch(), cause its much easier for high dimensional images like RGB. For instance, if you have an RGB image of 802 pixels width and 802 pixels height (3 channels) the number of bytes per row is not width * channels * sizeof(char), but a multiplier of 4 byte (32-bit). So the image is stored linear in the memory, like: R G B R G B R G B R … until the end of the row. There, a memory alignment is done, so that its easier and faster to access the single elements and of course the next row of pixels. Even cudaMalloc() or cudaMallocPitch() does this, because a single cell in global memory has a width of 32-bits (4*char).

As you said, when RGB is stored in a vector… that might be difficult how the values are stored in memory… but normally they a stored linear in the memory. So you can access them by simply iterate through the image array. But, be aware of memory alignments on host and device as I described above. With cudaMemcpy2D() you can tell CUDA the pitches, the actual number of bytes per row and the height of the image you want to copy from or to device. When you do this, everything is fine.

I hope, I could clarify my answer.