Kernel for pixel format conversions

I’m currently investigating CUDA for enhancing the performance of my application regarding image conversion.

I get a frame buffer (VUYA 4:4:4:4 packed) as float* and would like to convert it to something like ayuv 4:4:4:4 16 bit packed.

Are there any existing kernels performing better then my first try below?

// VUYA 4444 32f -> ayuv64le
__global__ void convert_kernel(const float4* d_src, ushort4* d_dst, const int width, const int height)
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)

    const int p = y * width + x;
    const int p2 = (height - 1 - y) * width + x;

    d_dst[p2].x = (unsigned short)(d_src[p].w * 65535.0f); // A
    d_dst[p2].y = (unsigned short)(((d_src[p].z + 0.07306f) / 1.16438f) * 65535.0f); // Y
    d_dst[p2].z = (unsigned short)(((d_src[p].y + 0.57143f) / 1.14286f) * 65535.0f); // U
    d_dst[p2].w = (unsigned short)(((d_src[p].x + 0.57143f) / 1.14286f) * 65535.0f); // V

I would suggest running this code through the CUDA profiler. It looks bound by memory throughput to me, so make sure to maximize effective bandwidth.

Mathematically, you can simplify ((X + C0) / C1) * C2 into a single fmaf (X, C2/C1, C0*C2/C1), but that is unlikely to provide benefits in memory-throughput bound code.

The code seems to flip the image vertically in addition to simply converting the format.

Thanks for the suggestions. I changed it a bit and removed the divisions:

d_dst[p2].x = (unsigned short)(d_src[p].w * 65535.0f); // A
d_dst[p2].y = (unsigned short)((d_src[p].z + 0.07306f) * 56294.565f); // Y
d_dst[p2].z = (unsigned short)((d_src[p].y + 0.57143f) * 57343.125f); // U
d_dst[p2].w = (unsigned short)((d_src[p].x + 0.57143f) * 57343.125f); // V

I ran it in the profiler and it seems the kernel itself is actually not that much of an issue and it’s already 3 times faster than using AVX2. It’s rather the cudaMemCopy() to the device and back to the host that causes a delay of ~10ms. Is there a chance or technique to speed this up a bit?

What’s also wondering me is that my Quadro P2000 is way faster than an RTX 2080 TI.

When transforming the math why stop stop halfway? The CUDA compiler is conservative (which is a Good Thing™ for numerical code) and won’t re-associate floating-point computation other than applying fmul/fadd contraction into fma.

fmaf (d_src[p].z, 65535.0f/1.16438f, 0.07306f*65535.0f/1.16438f);

BTW, I assume those magic numbers in your code are actually fractions of some sort, e.g. 1.16438 = 298/256? If so, it would be clearer to write them accordingly so someone who reads the code can more easily see what is going on. The compiler is going to evaluate compile-time constant expressions just fine, leaving a single constant.

(1) Make sure the PCIe interface is actually operating at full PCIe gen3 x16 speeds. In many systems, only one or two of the PCIe slots are configured as x16. (2) Copy the data in as huge chunks as possible, to minimize overhead (3) Pin the host side memory you are transferring to/from. As a general strategy, minimize data movement between host and device. If this kernel is the only thing you are doing with the data, performance will be limited by anemic PCIe throughput.

Let the CUDA profiler guide you. Are both executables compiled with full optimization and for the correct GPU target architecture? Are the memory access patterns of the kernel optimal? Does anything change when you try restricted pointers (I doubt this makes a difference here but good practice):

__global__ void convert_kernel (const float4* __restrict__ d_src, ushort4* __restrict__ d_dst, const int width, const int height)

As far as I can determine from rummaging around the internet, the magic numbers in the code are from YCbCr standard color space conversion:

Y’ = 16 + 219 * Y ==> 16/219 ~= 0.07306, 255/219 ~= 1.16438
CB = 128 + 224 * PB ==> 128/224 ~= 0.57143, 256/224 ~= 1.14286
CR = 128 + 224 * PR ==> 128/224 ~= 0.57143, 256/224 ~= 1.14286

Yes, the magic numbers are also adjusting the video levels (limited range).

I have changed the code like this:

#pragma once

#include "cuda_runtime.h"

class CudaConversionService
	CudaConversionService(const int width, const int height);
	virtual ~CudaConversionService();
	bool ConvertToAVUY64(const float* src, char* dst);

	const int width;
	const int height;
	const dim3 threadsPerBlock;
	const dim3 numBlocks;
	size_t sizeSrc;
	size_t sizeDst;

#include <math.h>

#include "convert.h"

#define BLOCK_SIZE 32

#define Y_FOOT (16.0f / 219.0f)
#define Y_HEAD (255.0f / 219.0f)
#define CBCR_FOOT (128.0f / 224.0f)
#define CBCR_HEAD (256.0f / 224.0f)
#define F_USHRT_MAX 65535.0f

__device__ float4* d_src;
__device__ ushort4* d_dst;

__global__ void ayuv64_kernel(const float4* __restrict__ d_src, ushort4* __restrict__ d_dst, const int width, const int height)
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)

    const int p = y * width + x;
    const int p2 = (height - 1 - y) * width + x;

    // Swap channels and apply video levels
    d_dst[p2] = make_ushort4(
        (unsigned short)(d_src[p].w * F_USHRT_MAX),
        (unsigned short)fmaf(d_src[p].z, F_USHRT_MAX / Y_HEAD, Y_FOOT * F_USHRT_MAX / Y_HEAD),
        (unsigned short)fmaf(d_src[p].y, F_USHRT_MAX / CBCR_HEAD, CBCR_FOOT * F_USHRT_MAX / CBCR_HEAD),
        (unsigned short)fmaf(d_src[p].x, F_USHRT_MAX / CBCR_HEAD, CBCR_FOOT * F_USHRT_MAX / CBCR_HEAD)

CudaConversionService::CudaConversionService(const int width, const int height):
    threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE),
        ceil((float)width / threadsPerBlock.x), 
        ceil((float)height / threadsPerBlock.y))

    if (d_src)

    if (d_dst)

bool CudaConversionService::Init()
    // Define buffer sizes
    sizeSrc = width * height * sizeof(float4);
    sizeDst = width * height * sizeof(ushort4);

    // Reserve memory on CPU and GPU
    if (cudaMalloc((void**)&d_src, sizeSrc) == cudaSuccess)
        if (cudaMalloc((void**)&d_dst, sizeDst) == cudaSuccess)
            return true;

    return true;

bool CudaConversionService::ConvertToAVUY64(const float* src, char* dst)
    // Copy framebuffer to device
    if (cudaMemcpy(d_src, src, sizeSrc, cudaMemcpyHostToDevice) == cudaSuccess)
        ayuv64_kernel << < numBlocks, threadsPerBlock >> > (d_src, d_dst, width, height);

        if (cudaMemcpy(dst, d_dst, sizeDst, cudaMemcpyDeviceToHost) == cudaSuccess)
            return true;

    return false;

I guess to optimize it for for speed it is neccessary to process multiple frames at once (with one cudaMemCopy) to reduce overhead.

All my GPU are using PCIe x16 3.0 (tested it with GPU-Z).

About the value for “Code Generation”: Currently it is set to “compute_52,sm_52” which provides most backward compatibility. But I’d also like to use the advantages of more modern GPUs. How can I configure this in Visual Studio?

Standard practice for CUDA-accelerated applications is to build fat binaries that include machine code (SASS) for all GPU architectures one needs to support, plus PTX for the latest GPU architecture. I do not use the MSVS IDE so cannot tell you where to dial in the necessary information. When you use nvcc from the command line, you would use the -gencode switch.

1 Like