Performance of conversion byte->float vs float->byte

ewan · April 23, 2014, 2:47pm

Based on the parallel for all blog post http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/ about an efficient matrix transpose I set up my own copy kernel to use as a benchmark when measuring performance of other kernels.

The first two kernels I measured were for converting between single channel byte (unsigned char) and float images. I was surprised to see that the kernel for converting from float to byte Convert_32f_8u gets a bandwidth around 70% of copy kernel whereas the kernel for converting byte to float Convert_8u_32f was more like 40%. Can anyone spot anything obvious in the kernels below that would suggest why this should be? The operations involved are nearly identical.

#define TILE_DIM 32
#define BLOCK_ROWS 4

struct CUDAImageBase
{
	UINT nWidth;
	UINT nHeight;
	UINT nPitch;
	void* pData;

};

///////////////////////////////////////////////////////////////////////////////////////////////
__global__ void CopyTest(const CUDAImageBase srcImage, CUDAImageBase destImage)
{
	// Work out the location of the pixel being accessed by this thread
	int x = blockIdx.x * TILE_DIM + threadIdx.x;
	int y = blockIdx.y * TILE_DIM + threadIdx.y;

	// Ensure x is within image bounds
	if (x < srcImage.nWidth)
	{
		const float* pSrcData = (float*)((unsigned char *)srcImage.pData + y * srcImage.nPitch) + x;
		float* pDestData = (float*)((unsigned char *)destImage.pData + y * destImage.nPitch) + x;
		int nSrcOffset = srcImage.nPitch * (BLOCK_ROWS / sizeof(float));
		int nDestOffset = destImage.nPitch * (BLOCK_ROWS / sizeof(float));

		// Loop for each row
		int iMax = min((int)TILE_DIM, (int)srcImage.nHeight - y);
		for (int i = 0; i < iMax; i += BLOCK_ROWS)
		{
			// Copy data
			pDestData[0] = pSrcData[0];

			// Move on to next row
			pSrcData += nSrcOffset;
			pDestData += nDestOffset;
		}
	}

} // CopyTest



///////////////////////////////////////////////////////////////////////////////////////////////
__global__ void Convert_8u_32f(const CUDAImageBase srcImage,
							   CUDAImageBase destImage,
							   const float fScale,
							   const float fOffset)
{
	// Work out the location of the pixel being accessed by this thread
	int x = blockIdx.x * TILE_DIM + threadIdx.x;
	int y = blockIdx.y * TILE_DIM + threadIdx.y;

	// Ensure x is within image bounds
	if (x < srcImage.nWidth)
	{
		const unsigned char* pSrcData = ((unsigned char *)srcImage.pData + y * srcImage.nPitch) + x;
		float* pDestData = (float*)((unsigned char *)destImage.pData + y * destImage.nPitch) + x;
		int nSrcOffset = srcImage.nPitch * BLOCK_ROWS;
		int nDestOffset = destImage.nPitch * (BLOCK_ROWS / sizeof(float));

		// Loop for each row
		int iMax = min((int)TILE_DIM, (int)srcImage.nHeight - y);
		for (int i = 0; i < iMax; i += BLOCK_ROWS)
		{
			// Copy data
			pDestData[0] = __uint2float_rn(pSrcData[0]) * fScale + fOffset;

			// Move on to next row
			pSrcData += nSrcOffset;
			pDestData += nDestOffset;
		}
	}

} // Convert_8u_32f



///////////////////////////////////////////////////////////////////////////////////////////////
__global__ void Convert_32f_8u(const CUDAImageBase srcImage,
							   CUDAImageBase destImage,
							   const float fScale,
							   const float fOffset)
{
	// Work out the location of the pixel being accessed by this thread
	int x = blockIdx.x * TILE_DIM + threadIdx.x;
	int y = blockIdx.y * TILE_DIM + threadIdx.y;

	// Ensure x is within image bounds
	if (x < srcImage.nWidth)
	{
		const float* pSrcData = (float*)((unsigned char *)srcImage.pData + y * srcImage.nPitch) + x;
		unsigned char* pDestData = ((unsigned char *)destImage.pData + y * destImage.nPitch) + x;
		int nSrcOffset = srcImage.nPitch * (BLOCK_ROWS / sizeof(float));
		int nDestOffset = destImage.nPitch * BLOCK_ROWS;

		// Loop for each row
		int iMax = min((int)TILE_DIM, (int)srcImage.nHeight - y);
		for (int i = 0; i < iMax; i += BLOCK_ROWS)
		{
			// Copy data
			pDestData[0] = min(255, __float2uint_rn(pSrcData[0] * fScale + fOffset));

			// Move on to next row
			pSrcData += nSrcOffset;
			pDestData += nDestOffset;
		}
	}

} // Convert_32f_8u

I’m testing on SM 3.0 cards so my understanding is that the alignment shouldn’t be a big deal. All the kernels are fast (and appear to function correctly) but the discrepancy in speeds is confusing me so any suggestions would be appreciated.

Thanks,
Ewan

njuffa · April 23, 2014, 6:51pm

Outstanding loads are tracked inside the GPU by hardware queues of finite length. When individual accesses are small, the total memory traffic attributed to the queued loads is not sufficient to saturate the read bandwidth. On some GPUs bandwidth saturation can be achieved with 32-bit accesses, on others 64-bit accesses are needed. If possible (i.e. where alignment requirements allow it), access byte-sized data in groups of four as uchar4, which makes each individual access a 32-bit access.

For streaming applications like this, on most processors, whether CPUs or GPUs, optimizing the read traffic is usually crucial for performance whereas stores are typically much less important (they are pretty much “fire and forget” on the GPU). However, it is probably a good idea to access byte-size data via uchar4 for both read and writes to maximize performance.

ewan · April 29, 2014, 10:23am

ok, thanks. So for the case of converting from byte to float this would suggest one of two approaches:

Have each thread read a single uchar4 and output 4 floats
Have a block of n threads read n/4 uchar4 (using subset of threads) into shared memory and then have each thread write out a single float.

It seems pretty clear that the first option should perform better so I’l try that and see what difference it makes.

Thanks,
Ewan

ewan · April 29, 2014, 11:11am

So to follow up I modified the kernels to operate on uchar4/float4 whenever the image widths were a multiple of 4 in size. Converting byte to float now runs 2x faster than before and there was a smaller speedup for the float to byte conversion as predicted. Both kernels now run at approximately the same speed which is about 80% of the speed of the idealised copy kernel.

Thanks,
Ewan

Topic		Replies	Views
Fast way (on device) to convert from byte to float CUDA Programming and Performance	6	11748	August 20, 2007
Library for conversion of cuda datatypes? CUDA Programming and Performance	9	2642	October 16, 2019
Moving unsigned chars from a file to floats on a GPU a transfer performance strategy request CUDA Programming and Performance	12	2758	July 27, 2010
Kernel for pixel format conversions CUDA Programming and Performance cuda	6	984	October 16, 2020
Use of char instead of float: Performance enhances CUDA Programming and Performance	1	835	November 10, 2012
Low Bandwidth with simple data copy CUDA Programming and Performance	4	9170	December 7, 2011
Memory bank access during int to short conversion CUDA Programming and Performance	2	710	June 23, 2011
Type conversions on-board the GPU What's the most efficient way? CUDA Programming and Performance	3	3618	February 27, 2009
Fast type casting (short -> float) CUDA Programming and Performance	2	2268	October 16, 2008
Fastest way to convert float4 to uchar4? Texture conversion CUDA Programming and Performance	3	5497	April 23, 2010

Performance of conversion byte->float vs float->byte

Related topics