Kernel for pixel format conversions

dstankewitz · October 14, 2020, 10:24am

I’m currently investigating CUDA for enhancing the performance of my application regarding image conversion.

I get a frame buffer (VUYA 4:4:4:4 packed) as float* and would like to convert it to something like ayuv 4:4:4:4 16 bit packed.

Are there any existing kernels performing better then my first try below?

// VUYA 4444 32f -> ayuv64le
__global__ void convert_kernel(const float4* d_src, ushort4* d_dst, const int width, const int height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)
        return;

    const int p = y * width + x;
    const int p2 = (height - 1 - y) * width + x;

    d_dst[p2].x = (unsigned short)(d_src[p].w * 65535.0f); // A
    d_dst[p2].y = (unsigned short)(((d_src[p].z + 0.07306f) / 1.16438f) * 65535.0f); // Y
    d_dst[p2].z = (unsigned short)(((d_src[p].y + 0.57143f) / 1.14286f) * 65535.0f); // U
    d_dst[p2].w = (unsigned short)(((d_src[p].x + 0.57143f) / 1.14286f) * 65535.0f); // V
}

njuffa · October 14, 2020, 6:50pm

I would suggest running this code through the CUDA profiler. It looks bound by memory throughput to me, so make sure to maximize effective bandwidth.

Mathematically, you can simplify ((X + C0) / C1) * C2 into a single fmaf (X, C2/C1, C0*C2/C1), but that is unlikely to provide benefits in memory-throughput bound code.

The code seems to flip the image vertically in addition to simply converting the format.

dstankewitz · October 14, 2020, 8:19pm

Thanks for the suggestions. I changed it a bit and removed the divisions:

d_dst[p2].x = (unsigned short)(d_src[p].w * 65535.0f); // A
d_dst[p2].y = (unsigned short)((d_src[p].z + 0.07306f) * 56294.565f); // Y
d_dst[p2].z = (unsigned short)((d_src[p].y + 0.57143f) * 57343.125f); // U
d_dst[p2].w = (unsigned short)((d_src[p].x + 0.57143f) * 57343.125f); // V

I ran it in the profiler and it seems the kernel itself is actually not that much of an issue and it’s already 3 times faster than using AVX2. It’s rather the cudaMemCopy() to the device and back to the host that causes a delay of ~10ms. Is there a chance or technique to speed this up a bit?

What’s also wondering me is that my Quadro P2000 is way faster than an RTX 2080 TI.

njuffa · October 14, 2020, 8:30pm

When transforming the math why stop stop halfway? The CUDA compiler is conservative (which is a Good Thing™ for numerical code) and won’t re-associate floating-point computation other than applying fmul/fadd contraction into fma.

fmaf (d_src[p].z, 65535.0f/1.16438f, 0.07306f*65535.0f/1.16438f);

BTW, I assume those magic numbers in your code are actually fractions of some sort, e.g. 1.16438 = 298/256? If so, it would be clearer to write them accordingly so someone who reads the code can more easily see what is going on. The compiler is going to evaluate compile-time constant expressions just fine, leaving a single constant.

(1) Make sure the PCIe interface is actually operating at full PCIe gen3 x16 speeds. In many systems, only one or two of the PCIe slots are configured as x16. (2) Copy the data in as huge chunks as possible, to minimize overhead (3) Pin the host side memory you are transferring to/from. As a general strategy, minimize data movement between host and device. If this kernel is the only thing you are doing with the data, performance will be limited by anemic PCIe throughput.

Let the CUDA profiler guide you. Are both executables compiled with full optimization and for the correct GPU target architecture? Are the memory access patterns of the kernel optimal? Does anything change when you try restricted pointers (I doubt this makes a difference here but good practice):

__global__ void convert_kernel (const float4* __restrict__ d_src, ushort4* __restrict__ d_dst, const int width, const int height)

njuffa · October 14, 2020, 9:50pm

As far as I can determine from rummaging around the internet, the magic numbers in the code are from YCbCr standard color space conversion:

Y’ = 16 + 219 * Y ==> 16/219 ~= 0.07306, 255/219 ~= 1.16438
C_B = 128 + 224 * P_B ==> 128/224 ~= 0.57143, 256/224 ~= 1.14286
C_R = 128 + 224 * P_R ==> 128/224 ~= 0.57143, 256/224 ~= 1.14286

dstankewitz · October 16, 2020, 4:58pm

Yes, the magic numbers are also adjusting the video levels (limited range).

I have changed the code like this:

#pragma once

#include "cuda_runtime.h"

class CudaConversionService
{
public:
	CudaConversionService(const int width, const int height);
	virtual ~CudaConversionService();
	bool ConvertToAVUY64(const float* src, char* dst);

private:
	const int width;
	const int height;
	const dim3 threadsPerBlock;
	const dim3 numBlocks;
	size_t sizeSrc;
	size_t sizeDst;
};

#include <math.h>

#include "convert.h"

#define BLOCK_SIZE 32

#define Y_FOOT (16.0f / 219.0f)
#define Y_HEAD (255.0f / 219.0f)
#define CBCR_FOOT (128.0f / 224.0f)
#define CBCR_HEAD (256.0f / 224.0f)
#define F_USHRT_MAX 65535.0f

__device__ float4* d_src;
__device__ ushort4* d_dst;

__global__ void ayuv64_kernel(const float4* __restrict__ d_src, ushort4* __restrict__ d_dst, const int width, const int height)
{
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)
        return;

    const int p = y * width + x;
    const int p2 = (height - 1 - y) * width + x;

    // Swap channels and apply video levels
    d_dst[p2] = make_ushort4(
        (unsigned short)(d_src[p].w * F_USHRT_MAX),
        (unsigned short)fmaf(d_src[p].z, F_USHRT_MAX / Y_HEAD, Y_FOOT * F_USHRT_MAX / Y_HEAD),
        (unsigned short)fmaf(d_src[p].y, F_USHRT_MAX / CBCR_HEAD, CBCR_FOOT * F_USHRT_MAX / CBCR_HEAD),
        (unsigned short)fmaf(d_src[p].x, F_USHRT_MAX / CBCR_HEAD, CBCR_FOOT * F_USHRT_MAX / CBCR_HEAD)
    );
}

CudaConversionService::CudaConversionService(const int width, const int height):
    width(width), 
    height(height), 
    threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE),
    numBlocks(
        ceil((float)width / threadsPerBlock.x), 
        ceil((float)height / threadsPerBlock.y))
{}

CudaConversionService::~CudaConversionService()
{
    if (d_src)
        cudaFree(d_src);

    if (d_dst)
        cudaFree(d_dst);
}

bool CudaConversionService::Init()
{
    // Define buffer sizes
    sizeSrc = width * height * sizeof(float4);
    sizeDst = width * height * sizeof(ushort4);

    // Reserve memory on CPU and GPU
    if (cudaMalloc((void**)&d_src, sizeSrc) == cudaSuccess)
    {
        if (cudaMalloc((void**)&d_dst, sizeDst) == cudaSuccess)
            return true;
        
        cudaFree(d_src);
    }

    return true;
}

bool CudaConversionService::ConvertToAVUY64(const float* src, char* dst)
{
    // Copy framebuffer to device
    if (cudaMemcpy(d_src, src, sizeSrc, cudaMemcpyHostToDevice) == cudaSuccess)
    {
        ayuv64_kernel << < numBlocks, threadsPerBlock >> > (d_src, d_dst, width, height);

        if (cudaMemcpy(dst, d_dst, sizeDst, cudaMemcpyDeviceToHost) == cudaSuccess)
            return true;
    }

    return false;
}

I guess to optimize it for for speed it is neccessary to process multiple frames at once (with one cudaMemCopy) to reduce overhead.

All my GPU are using PCIe x16 3.0 (tested it with GPU-Z).

About the value for “Code Generation”: Currently it is set to “compute_52,sm_52” which provides most backward compatibility. But I’d also like to use the advantages of more modern GPUs. How can I configure this in Visual Studio?

njuffa · October 16, 2020, 7:16pm

Standard practice for CUDA-accelerated applications is to build fat binaries that include machine code (SASS) for all GPU architectures one needs to support, plus PTX for the latest GPU architecture. I do not use the MSVS IDE so cannot tell you where to dial in the necessary information. When you use nvcc from the command line, you would use the -gencode switch.

Topic		Replies	Views
Re_arranging Cuda array CUDA Programming and Performance	8	44	September 23, 2024
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	488	August 6, 2021
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6062	November 16, 2010
Performance of conversion byte->float vs float->byte CUDA Programming and Performance	3	3984	April 29, 2014
Cuda demosaicing running slowly CUDA Programming and Performance	7	5686	November 16, 2011
Malvar-He-Cutler Demosaicing Code and optimization advice? CUDA Programming and Performance	11	6062	January 23, 2011
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1016	October 16, 2023
Can this be optimized more? CUDA Programming and Performance	5	1046	January 8, 2015
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	35	January 20, 2025
CUDA performance How to improve my code? CUDA Programming and Performance	16	13574	June 29, 2010

Kernel for pixel format conversions

Related topics