Nvidia docker decoder and cuda function performance issue on multiple cards

Dear Guys,

There is a nvidia docker decoder and cuda function performance issue on mutiple cards we have meet. We have test on Tesla T4 cards. We used 2 cards on one server to test. At beginning, we just use nvidia docker work on cenos 7, we found that when we use 2 cards working at the same time, the performance reduced almost to half. Then we use vmware to create 2 virtual machines on one server with gpu passthrough technology, each vm bound with one card, the result show each card arrive it’s best performance.The test case and snaps as follow:

The program we use nvidia Video_Codec_SDK_11.1.5 to decode , then we write cuda code transform nv12 to bgr:
bool CudaSyncData::DevDataSyncDev(void *dst, int dst_width, int dst_height, void *src, int src_width, int src_height)
{
cuCtxPushCurrent(context);
void *bgr_ptr = nullptr;
if (src_width == dst_width && src_height == dst_height)
{
bgr_ptr = dst;
}
else
{
AI_Log(“CudaSyncData”, KLERROR, “not support width or height covert!please input the same width height”);
return false;
}
Nv12ToBGR(bgr_ptr, src, src_width, src_height);

cuCtxPopCurrent(NULL);
return true;

}

cuframe_convet.cu
#include <cuda_runtime_api.h>
#include “cuframe_convert.h”

const auto THREADS_PER_BLOCK_1D = 16u;
const auto CUDA_NUM_THREADS = 512u;

unsigned int GetNumberCudaBlocks(const unsigned int totalRequired,
const unsigned int numberCudaThreads = CUDA_NUM_THREADS)
{
return (totalRequired + numberCudaThreads - 1) / numberCudaThreads;
}

global void YCrCb2RBG(unsigned charpYdata,unsigned char pUVdata,int stepY,int stepUV,unsigned char* pImgData,int width,int height,int channels)
{
const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
const int tidy = blockIdx.y * blockDim.y + threadIdx.y;

if(tidx < width && tidy < height)
{
	int indexY,indexU,indexV;
	unsigned char Y,U,V;
	int R,G,B;
	indexY = tidy * stepY + tidx;
	Y = pYdata[indexY];
	
	if(tidx % 2 == 0){
		indexU = tidy / 2 * stepUV + tidx;
		indexV = tidy / 2 * stepUV + tidx + 1;
		U = pUVdata[indexU];
		V = pUVdata[indexV];
	}else{
		indexV = tidy / 2 * stepUV + tidx;
		indexU = tidy / 2 * stepUV + tidx - 1;
		U = pUVdata[indexU];
		V = pUVdata[indexV];
	}

	R = 1.164 * (Y - 16) + 1.596 * (V - 128);
	G = 1.164 * (Y - 16) - 0.183 * (U-128) - 0.392 * ( V - 128);
	B = 1.164 * (Y - 16) + 2.017 * (U-128);

	// R = Y ;
	// G = Y ;
	// B = Y ;
	

	if(R > 255)
		R = 255;
	if(R < 0)
		R = 0;
	if(G > 255)
		G = 255;
	if(G < 0)
		G = 0;
	if(B > 255)
		B = 255;
	if(B < 0)
		B = 0;
	
	pImgData[(tidy*width + tidx) * channels + 2] = (unsigned char)R;
	pImgData[(tidy*width + tidx) * channels + 1] = (unsigned char)G;
	pImgData[(tidy*width + tidx) * channels + 0] = (unsigned char)B;

}

}

void Nv12ToBGR(uint8_t* bgr, uint8_t* nv12, int width, int height)
{

int channels = 3;
int yu_offset = width * height;
int sourceWitdh = width;
int sourceHeight = height;

const dim3 dimBlock(THREADS_PER_BLOCK_1D, THREADS_PER_BLOCK_1D, 1);
const dim3 dimGrid{GetNumberCudaBlocks(sourceWitdh, dimBlock.x),GetNumberCudaBlocks(sourceHeight, dimBlock.y), 1};

YCrCb2RBG<<<dimGrid,dimBlock,0>>>(nv12, nv12 + yu_offset, sourceWitdh, sourceWitdh, bgr, sourceWitdh, sourceHeight, channels); 

}

  1. we just run one docker container on one gpu card.we put 30 channels of rtsp video to test, the performance is good. we can get valid picture from the program, the fps kept on more than 25.
    image

  2. then we run one docker container with 2 gpu cards to test. we put 60 channels of rtsp video to test, the result reduce to half
    image (1)

  3. after that , we run two docker containers with 2 gpu cards, each container bound each card. we put 60 channels of rtsp video to test. we make sure that each container have 30 channels. the result is the same as on docker container with 2 cards.

  4. so we think the container can’t isolate the gpu card very well, then we use vmware to test.we created 2 virtual machines to test. we passthrough 2 cards to each machine.then run one container in each vm.The performance is well, each card is the same as the first case.

as the result, we don’t know what’s wrong with the nvidia docker

the environment:
os: centos 7
nvidia vga driver: 460.73.1
nvidia docker: 2.11.0 docker: 20.10.21
nvidia container runtime version 1.11.0

docker run commond:
docker run --runtime=nvidia -p 25000:8087 --name=$CONTAINER_NAME -d --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
-e NVIDIA_DRIVER_CAPABILITIES=video,compute,utility
-e EurekaClientEnable=false
-e NVIDIA_VISIBLE_DEVICES=0
-e EurekaInstancePort=8087
-e EurekaInstanceHeartRate=30
-v /etc/localtime:/etc/localtime:ro
-v /data/syl/serial/:/root/koala/osmagic/serial
$IMAGE_NAME
docker logs -f $CONTAINER_NAME