Hi:
We’ve developed a product with video. There are 4 input signals connected to the TX2, every signal has a video format YUV422 8bits Packed UYVY. I use cuda to complete the colorspace transformation. But the access speed of memory which allocated by cudaMallocHost is so slow, I can’t fetch the result in time, final, the FPS can’t reach 60.
So, I tested the performance of those memory with codes below.
void stdm_test(size_t cx, size_t cy, int times)
{
char* src;
char* dst;
const size_t sz = cx*cy*4;
src = new char[sz];
dst = new char[sz];
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < times; ++i) {
memcpy(dst, src, sz);
}
auto t2 = std::chrono::high_resolution_clock::now();
auto ts = std::chrono::duration<double>(t2 - t1).count();
printf("stdm takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);
delete[] src;
delete[] dst;
}
void cuda_test(size_t cx, size_t cy, int times)
{
char* src;
char* dst;
const size_t sz = cx*cy*4;
cudaHostAlloc((void**)&src, sz, cudaHostAllocMapped);
cudaHostAlloc((void**)&dst, sz, cudaHostAllocMapped);
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < times; ++i) {
memcpy(dst, src, sz);
}
auto t2 = std::chrono::high_resolution_clock::now();
auto ts = std::chrono::duration<double>(t2 - t1).count();
printf("cuda takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);
cudaFreeHost(src);
cudaFreeHost(dst);
}
On TX2, the result is:
stdm: 3783 M/s
cuda: 582 M/s
I did the same testing on my PC with Quadra P620, the result is:
stdm: 4556 M/s
cuda: 4601 M/s
Why? Could you give me some suggestions to improve my program please? Please give me a minute to introduce my program.
I create a thread to poll the video device(/dev/video0~N), when a frame was ready, the thread will copy the frame data to the buffer which was allocated by cudaMallocHost, then start the cuda-kernel to complete colorspace transformation(To RGBA), finally, the thread broadcast the result to it’s subscribers.
I created a sample subscriber for our customers to demonstrate the usage of our product. When the subscriber received a frame, I clone it and share the clone to the renderer thread. The procedure of clone is just a cudaMemcpy invocation, but this operation will take more than 16 milliseconds, it’s really bad.