Nvidia docker decoder and cuda function performance issue on multiple cards

andycai_sc · November 2, 2022, 9:21am

Dear Guys,

There is a nvidia docker decoder and cuda function performance issue on mutiple cards we have meet. We have test on Tesla T4 cards. We used 2 cards on one server to test. At beginning, we just use nvidia docker work on cenos 7, we found that when we use 2 cards working at the same time, the performance reduced almost to half. Then we use vmware to create 2 virtual machines on one server with gpu passthrough technology, each vm bound with one card, the result show each card arrive it’s best performance.The test case and snaps as follow:

The program we use nvidia Video_Codec_SDK_11.1.5 to decode , then we write cuda code transform nv12 to bgr:
bool CudaSyncData::DevDataSyncDev(void *dst, int dst_width, int dst_height, void *src, int src_width, int src_height)
{
cuCtxPushCurrent(context);
void *bgr_ptr = nullptr;
if (src_width == dst_width && src_height == dst_height)
{
bgr_ptr = dst;
}
else
{
AI_Log(“CudaSyncData”, KLERROR, “not support width or height covert!please input the same width height”);
return false;
}
Nv12ToBGR(bgr_ptr, src, src_width, src_height);

cuCtxPopCurrent(NULL);
return true;

}

cuframe_convet.cu
#include <cuda_runtime_api.h>
#include “cuframe_convert.h”

const auto THREADS_PER_BLOCK_1D = 16u;
const auto CUDA_NUM_THREADS = 512u;

unsigned int GetNumberCudaBlocks(const unsigned int totalRequired,
const unsigned int numberCudaThreads = CUDA_NUM_THREADS)
{
return (totalRequired + numberCudaThreads - 1) / numberCudaThreads;
}

global void YCrCb2RBG(unsigned charpYdata,unsigned char pUVdata,int stepY,int stepUV,unsigned char* pImgData,int width,int height,int channels)
{
const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
const int tidy = blockIdx.y * blockDim.y + threadIdx.y;

if(tidx < width && tidy < height)
{
	int indexY,indexU,indexV;
	unsigned char Y,U,V;
	int R,G,B;
	indexY = tidy * stepY + tidx;
	Y = pYdata[indexY];
	
	if(tidx % 2 == 0){
		indexU = tidy / 2 * stepUV + tidx;
		indexV = tidy / 2 * stepUV + tidx + 1;
		U = pUVdata[indexU];
		V = pUVdata[indexV];
	}else{
		indexV = tidy / 2 * stepUV + tidx;
		indexU = tidy / 2 * stepUV + tidx - 1;
		U = pUVdata[indexU];
		V = pUVdata[indexV];
	}

	R = 1.164 * (Y - 16) + 1.596 * (V - 128);
	G = 1.164 * (Y - 16) - 0.183 * (U-128) - 0.392 * ( V - 128);
	B = 1.164 * (Y - 16) + 2.017 * (U-128);

	// R = Y ;
	// G = Y ;
	// B = Y ;
	

	if(R > 255)
		R = 255;
	if(R < 0)
		R = 0;
	if(G > 255)
		G = 255;
	if(G < 0)
		G = 0;
	if(B > 255)
		B = 255;
	if(B < 0)
		B = 0;
	
	pImgData[(tidy*width + tidx) * channels + 2] = (unsigned char)R;
	pImgData[(tidy*width + tidx) * channels + 1] = (unsigned char)G;
	pImgData[(tidy*width + tidx) * channels + 0] = (unsigned char)B;

}

}

void Nv12ToBGR(uint8_t* bgr, uint8_t* nv12, int width, int height)
{

int channels = 3;
int yu_offset = width * height;
int sourceWitdh = width;
int sourceHeight = height;

const dim3 dimBlock(THREADS_PER_BLOCK_1D, THREADS_PER_BLOCK_1D, 1);
const dim3 dimGrid{GetNumberCudaBlocks(sourceWitdh, dimBlock.x),GetNumberCudaBlocks(sourceHeight, dimBlock.y), 1};

YCrCb2RBG<<<dimGrid,dimBlock,0>>>(nv12, nv12 + yu_offset, sourceWitdh, sourceWitdh, bgr, sourceWitdh, sourceHeight, channels);

}

we just run one docker container on one gpu card.we put 30 channels of rtsp video to test, the performance is good. we can get valid picture from the program, the fps kept on more than 25.
then we run one docker container with 2 gpu cards to test. we put 60 channels of rtsp video to test, the result reduce to half
after that , we run two docker containers with 2 gpu cards, each container bound each card. we put 60 channels of rtsp video to test. we make sure that each container have 30 channels. the result is the same as on docker container with 2 cards.
so we think the container can’t isolate the gpu card very well, then we use vmware to test.we created 2 virtual machines to test. we passthrough 2 cards to each machine.then run one container in each vm.The performance is well, each card is the same as the first case.

as the result, we don’t know what’s wrong with the nvidia docker

the environment:
os: centos 7
nvidia vga driver: 460.73.1
nvidia docker: 2.11.0 docker: 20.10.21
nvidia container runtime version 1.11.0

docker run commond:
docker run --runtime=nvidia -p 25000:8087 --name=$CONTAINER_NAME -d --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
-e NVIDIA_DRIVER_CAPABILITIES=video,compute,utility
-e EurekaClientEnable=false
-e NVIDIA_VISIBLE_DEVICES=0
-e EurekaInstancePort=8087
-e EurekaInstanceHeartRate=30
-v /etc/localtime:/etc/localtime:ro
-v /data/syl/serial/:/root/koala/osmagic/serial
$IMAGE_NAME
docker logs -f $CONTAINER_NAME

Topic		Replies	Views
decode video on multiple graphics cards CUDA Programming and Performance	0	549	August 15, 2013
[Linux] NVCuvid - Performarce CUDA Programming and Performance	13	4112	March 9, 2016
Slow parallel performance when using three (3) Nvidia p100 for encoding/decoding on the same server. CUDA Programming and Performance	1	805	December 18, 2018
Buy several GTX cards or a simple Quadro card Video Processing & Optical Flow	3	811	December 5, 2018
Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance Video Processing & Optical Flow	1	709	December 6, 2019
Hardware decoding and multi-GPU through Docker DeepStream SDK deepstream	3	82	July 31, 2025
Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU CUDA Programming and Performance	12	12578	March 12, 2010
Use nvcuvid with several board CUDA Programming and Performance	7	1017	October 20, 2010
Does CUDA video decoder support multiple GPU? CUDA Programming and Performance	5	2814	June 25, 2010
NVML multiple instance of NVDEC General Topics and Other SDKs	1	661	April 8, 2017

Nvidia docker decoder and cuda function performance issue on multiple cards

Related topics