The performance of nvhost-vic

Hi,

We’ve developed a video capture device. The format of input video is YUV422 8bits 1920x1080@60Hz, and we use CUDA to implement colorspace conversation. When we process 2 input signals simultaneously, it works well, the FPS is 60. But it will drop to 40 for 3 input signals, 30 for 4 input signals. Our application will do a lot of rendering works also, so we use nvhost-vic to complete colorspace conversation(UYVY->ABGR32) to save GPU. The FPS does not improve.

So, who constraints the performance of nvhost-vic? How can I improve the FPS for multiple input signals ?

1 Like

Hi,
For information, please share your release version( $ head -1 /etc/nv_tegra_release ).

#R32 (release) , REVISION: 1.0, GCID: 14531094, BOARD: t186ref, EABI:aarch64, DATE: Wed Mar 13 07:41:08 UTC 2019

We use CUDA to complete colorspace conversation before. AastaLLL suggests using unified memory, it’s a good idea, we can process 2 input signals in time. But we should save GPU for our customers, so we move our implementation to nvhost-vic. Unfortunately, 2 input signals is a challenge already, not to mention 3, 4.

So, we wonder some solution to improve the performance of colorspace conversation. I noticed that the performance would dropped a little as the temperature increased. We install 8 TX2i to a closed box(about 400x150x200), should we adopt some special thermal solution?

Hi,
Please share the pipeline for reference. Do you use gstreamer or jetson_multimedia_api?

I use jetson_multimedia_api. The pipeline is like below:

/dev/videoN → [v4l2 dqbuffer] ->[copy data to UYVY buffer]->[output_plane dqBuffer, copy UYVY to NvBuffer, qBuffer]->[v4l2 qbuffer].

I count the FPS in the callback function.

Hi,
Looks like you can refer to 12_camera_v4l2_cuda. It captures YUV422 frames through v4l2 and call NvBufferTransform() to convert to YUV420. You can replace YUV420 with RGBA for your usecase.

Hi,
Thanks very much. I’ll try it. Will update here if any progress.

Hi,

I have tried 3 methods to implement colorspace conversation.

  • with cuda
  • with nvhost-vic
  • with nvtransform

Unfortunately, none of them could process 4 input signals (1280x720) simultaneously in time. Here is a brief report of them.

  • cuda
    I used unified memroy to store the UYVY and the result(ABGR32).
Input signal Output format FPS
1280x720 UYVY 1280x720 ABGR32 50+
1280x720 UYVY 720x576 ABGR32 60
1920x1080 UYVY 1920x1080 ABGR32 30+
1920x1080 UYVY 720x576 ABR32 50+
  • hostvc
Input signal Output format FPS
1280x720 UYVY 1280x720 ABGR32 30+
1280x720 UYVY 720x576 ABGR32 50+
1920x1080 UYVY 1920x1080 ABGR32 15+, mostly
1920x1080 UYVY 720x576 ABR32 30+
hostvc is not that good. Because the resolution of the input may be changed, so I create/destroy the NvVideoConverter dynamically, sometimes it will result in deadlock, even if I set the flag to O_NONBLOCK.
  • nvtransform
Input signal Output format FPS
1280x720 UYVY 1280x720 ABGR32 30+
1280x720 UYVY 720x576 ABGR32 60
1920x1080 UYVY 1920x1080 ABGR32 15+, mostly
1920x1080 UYVY 720x576 ABR32 40+

Here is my code.

  • cuda version
///
/// (C) 2018-2020 CHIV
///
#include "BoogooSource.h"
#include "BoogooInput.h"
#include "domp/Mutex.h"
#include "cuda/cudaYUV.h"
#include <npp.h>
// STL
#include <string.h>
#include <atomic>
#include <list>
#include <string>

namespace input {

class BoogooSource::Resource {
public:
    Resource()
    {
        mState = 0;
        mGpuBuffer = nullptr;
        mGpuBufferSize = 0;
        mUseClipper = false;
        // create cuda stream
        int device_id;
        cudaGetDevice(&device_id);
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, device_id);
        if (!prop.deviceOverlap) {
            printf("\033[31mWarning:cuda-device does not support overlap\033[0m\n");
        }
        cudaStreamCreate(&mCudaStream);
    }
    ~Resource()
    {
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            delete *it;
        }
        if (mGpuBuffer) { cudaFree(mGpuBuffer); }
        cudaStreamDestroy(mCudaStream);
    }

    std::atomic<int> mState;
    //
    std::list<FrameSubscriber*> mSubs;
    //
    domp::Mutex mLock;
    //
    FrameSubscriber* AddSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            if (*it == sub) return sub;
        }
        mSubs.push_back(sub);
        //
        return sub;
    }
    FrameSubscriber* TakeSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        FrameSubscriber* ret = nullptr;
        for (auto it = mSubs.begin(); it != mSubs.end();) {
            if (*it == sub) {
                ret = *it;
                it = mSubs.erase(it);
                break;
            } else ++it;
        }
        return ret;
    }
    //
    void Broadcast(const boogoo::RGBA8Frame& frame, int error = FrameSubscriber::eOk)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            (*it)->Handle(frame, error);
        }
    }
    //
    template <typename T>
    inline bool CheckAndAlloc(T*& gpu_buffer, int& buffer_size, const int expsz)
    {
        if (expsz > buffer_size) {
            // need to reallocate
            if (gpu_buffer) {
                cudaFree(gpu_buffer);
                gpu_buffer = nullptr;
            }
            auto er = cudaMalloc((void**)&gpu_buffer, expsz);
            if (er != cudaSuccess) {
                return false;
            }
            buffer_size = expsz;
        }
        return true;
    }
    // The input format must be YUV 422 8bits with UYVY order.
    void Update(unsigned char* frame, int cx, int cy)
    {
        // check, if we should allocate memory, do it.
	    if (!CheckAndAlloc(mGpuBuffer, mGpuBufferSize, 4*cx*cy)) {
            Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eCUDAError);
            return;
        }
        // clip.
        if (mUseClipper && cx != 720 && cy != 576) {
            if (mPal.Cx() != 720 || mPal.Cy() != 576) {
                mPal = boogoo::RGBA8Frame(720, 576);
            }
            double affine[2][3] = {
                {((double)mPal.Cx())/1160, 0, -40},
                {0, ((double)mPal.Cy())/645,  -39}
            };
            //cudaStreamAttachMemAsync(mCudaStream, frame, 0, cudaMemAttachGlobal);
            //cudaStreamAttachMemAsync(mCudaStream, mPal.Data(), 0, cudaMemAttachGlobal);
            cudaUYVYToRGBA((uchar2*)frame, (uchar4*)mGpuBuffer, cx, cy);
            nppiWarpAffine_8u_C4R((Npp8u*)mGpuBuffer, NppiSize{cx, cy}, cx*4, NppiRect{64, 44, 1160, 645},
                (Npp8u*)mPal.Data(), mPal.Cx()*4, NppiRect{0,0,mPal.Cx(), mPal.Cy()}, 
               affine, NPPI_INTER_LINEAR);
            //cudaStreamAttachMemAsync(mCudaStream, frame, 0, cudaMemAttachHost);
            //cudaStreamAttachMemAsync(mCudaStream, mPal.Data(), 0, cudaMemAttachHost);
            //cudaStreamSynchronize(mCudaStream);
            Broadcast(mPal, FrameSubscriber::eOk);
        } else {
            if (mFrame.Cx() != cx || mFrame.Cy() != cy) {
                mFrame = boogoo::RGBA8Frame(cx, cy);
            }
            cudaUYVYToRGBA((uchar2*)frame, (uchar4*)mFrame.Data(), cx, cy);
            cudaStreamSynchronize(mCudaStream);
            Broadcast(mFrame, FrameSubscriber::eOk);
        }
    }
    //
    bool mUseClipper;
    unsigned char* mGpuBuffer;
    int mGpuBufferSize;
    boogoo::RGBA8Frame mFrame;
    boogoo::RGBA8Frame mPal; // 720x576
    cudaStream_t mCudaStream;
};

BoogooSource::BoogooSource(const char* devname)
    : mFrame(0)
    , mFrameSize(0)
{
    mResource = new Resource();
    mInput = new BoogooInput(devname);
}
BoogooSource::~BoogooSource()
{
    // stop thread.
    Stop();
    // delete frame buffer
    if (mFrame) { cudaFree(mFrame); }
    //
    delete mInput;
    delete mResource;
}

FrameSubscriber* BoogooSource::AddSubscriber(FrameSubscriber* sub)
{
    return mResource->AddSubscriber(sub);
}
FrameSubscriber* BoogooSource::TakeSubscriber(FrameSubscriber* sub)
{
    return mResource->TakeSubscriber(sub);
}

void BoogooSource::SetUseClipper(bool b)
{
    mResource->mUseClipper = b;
}

bool BoogooSource::GetUseClipper() const
{
    return mResource->mUseClipper;
}

void BoogooSource::SetResolution(int cx, int cy)
{
    if (mResource->mState) {
        mInput->SetResolution(cx, cy);
    }
}

void BoogooSource::GetResolution(int &cx, int &cy) const
{
    if (mResource->mState) {
        mInput->GetResolution(cx, cy);
    }
    else {
        cx = 1920;
        cy = 1080;
    }
}

bool BoogooSource::Grab(void* buffer, int sz)
{
    if (mResource->mState) {
        memcpy(buffer, mFrame, sz > mFrameSize ? mFrameSize : sz);
        return true;
    }
    else {
        memset(buffer, 0, sz);
        return false;
    }
}

bool BoogooSource::Start()
{
    if (mInput->Init()) {
        if (mInput->Start()) {
            return SimpleThread::Start(); // start the thread.
        }
    }
    return false;
}
#define ARRAYSZ(a)  (sizeof(a)/sizeof(a[0]))
int BoogooSource::Run()
{
    static const struct _fmtlist {
        int cx, cy;
    } fmtlist[] = {
        {1920, 1080},
        {1280, 720},
        {720, 576}
    };
    const int kTimeout = 100;
    int sfmt = 0; // always stands for the next fmt to try
    while (!ShouldExit()) {
        if (mFrameSize < mInput->GetFrameSize()) {
            if (mFrame) {
                cudaFree(mFrame);
            }
            if (cudaMallocManaged((void**)&mFrame, mInput->GetFrameSize(), cudaMemAttachHost) != cudaSuccess) {
                mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eCUDAError);
                break;
            } else {
                mFrameSize = mInput->GetFrameSize();
            }
        }
        if (!mInput->IsFrameAvailable(kTimeout)) {
	        // if failed, try another resolution.
            mResource->mState = 0;
            mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
            if (!mInput->Reset()) {
                mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                break;
            }
            sfmt = (sfmt+1) % ARRAYSZ(fmtlist); // prepare next fmt
        }
        else {
            int ret = mInput->Retrieve(mFrame, mFrameSize, 0);
            if (ret < 0) {
                mResource->mState = 0; 
                mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
                if (!mInput->Reset()) {
                    mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                    break;
                }
                sfmt = (sfmt+1) % ARRAYSZ(fmtlist); // prepare next fmt
                mResource->mState = 0; // update state
            }
            else {
                mResource->mState = 1;
		        int cx, cy;
		        mInput->GetResolution(cx, cy);
                // update
                mResource->Update(mFrame, cx, cy);
            }
        }
    }
    mResource->mState = 0;

    return 0;
}

}
  • nvhost version
///
/// (C) 2020 CHIV
///
#include "BoogooInput.h"
#include "BoogooSource.h"
#include "domp/Mutex.h"
#include "hostvc/NvVideoConverter.h"
#include "hostvc/nvbuf_utils.h"
#include <fcntl.h>
#include <uuid/uuid.h>
// STL
#include <atomic>
#include <list>
#include <string.h>
#include <string>

namespace input {

class BoogooSource::Resource {
public:
    Resource()
    {
        mState = 0;
        mUseClipper = false;
        mVc = nullptr;
    }
    ~Resource()
    {
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            delete *it;
        }
        if (mVc) {
            mVc->capture_plane.setStreamStatus(false);
            mVc->output_plane.setStreamStatus(false);
            mVc->capture_plane.stopDQThread();
            mVc->output_plane.stopDQThread();
            delete mVc;
        }
        delete mVc;
    }
    //
    std::atomic<int> mState;
    /// subscribers
    std::list<FrameSubscriber*> mSubs;
    /// protect \c mSubs
    domp::Mutex mLock;
    //
    FrameSubscriber* AddSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            if (*it == sub)
                return sub;
        }
        mSubs.push_back(sub);
        //
        return sub;
    }

    FrameSubscriber* TakeSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        FrameSubscriber* ret = nullptr;
        for (auto it = mSubs.begin(); it != mSubs.end();) {
            if (*it == sub) {
                ret = *it;
                it = mSubs.erase(it);
                break;
            } else
                ++it;
        }
        return ret;
    }
    //
    void Broadcast(const boogoo::RGBA8Frame& frame, int error = FrameSubscriber::eOk)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            (*it)->Handle(frame, error);
        }
    }
    //
    static inline void _NvBufferToFrame(NvBuffer* buffer, boogoo::RGBA8Frame& frame)
    {
        char* data = nullptr;
        size_t dptr = 0;
        //
        NvBuffer::NvBufferPlane& plane = buffer->planes[0];
        size_t linesz = plane.fmt.bytesperpixel * plane.fmt.width;
        data = (char*)plane.data;
        if (plane.fmt.stride == frame.BytesOfLine()) {
            memcpy(frame.Data(), data, frame.NumberOfBytes());
        } else {
            for (uint32_t j = 0; j < plane.fmt.height; j++) {
                memcpy(frame.Data() + dptr, data, linesz);
                dptr += linesz;
                data += plane.fmt.stride;
            }
        }
    }
    // Hint: dq thread
    static bool HostVcDQCallback(struct v4l2_buffer* v4l2_buf,
        NvBuffer* buffer, NvBuffer* shared_buffer,
        void* arg)
    {
        (void)shared_buffer;
        auto rc = static_cast<BoogooSource::Resource*>(arg);
        //
        if (!v4l2_buf) { return true; }
        if (v4l2_buf->m.planes[0].bytesused >= rc->mFrame.NumberOfBytes()) {
            _NvBufferToFrame(buffer, rc->mFrame);
            rc->Broadcast(rc->mFrame, FrameSubscriber::eOk);
        }
        if (rc->mVc->capture_plane.qBuffer(*v4l2_buf, buffer) < 0) {
            printf("qBuffer failed, %s:%d\n", __FILE__, __LINE__);
            abort();
            return false;
        }
        return true;
    }
    /// YUV422 8bits UYVY order
    bool CheckAndCreate(int cx, int cy)
    {
        bool rebuild = false;
        // 2 conditions result in rebuilding.
        if (mUseClipper && (mFrame.Cx() != 720 || mFrame.Cy() != 576)) {
            rebuild = true;
        } else if (!mUseClipper) {
            if (mFrame.Cx() != cx || mFrame.Cy() != cy) {
                rebuild = true;
            }
        }
        // rebuild the hostvc if needed.
        if (rebuild || !mVc) {
            printf("[boogoo] rebuild hostvc\n");
            if (mVc) { delete mVc; }
            if (mUseClipper) {
                if (mFrame.Cx() != 720 || mFrame.Cy() != 576) {
                    mFrame = boogoo::RGBA8Frame(720, 576);
                }
            } else {
                if (mFrame.Cx() != cx || mFrame.Cy() != cy) {
                    mFrame = boogoo::RGBA8Frame(cx, cy);
                }
            }
            //
            uuid_t uuid_; char uuid_str_[64];
            uuid_generate_random(uuid_);
            uuid_unparse(uuid_, uuid_str_);
            mVc = NvVideoConverter::createVideoConverter(uuid_str_, O_NONBLOCK);
            if (mVc->isInError()) {
                printf("can't create hostvc\n");
                delete mVc; mVc = nullptr;
                return false;
            }
            int ret = 0;
            ret = mVc->setOutputPlaneFormat(V4L2_PIX_FMT_UYVY, cx, cy, V4L2_NV_BUFFER_LAYOUT_PITCH);
            if (mUseClipper) {
                ret = mVc->setCropRect(64, 44, 1160, 645);
                ret = mVc->setCapturePlaneFormat(V4L2_PIX_FMT_ABGR32,
                    720, 576, V4L2_NV_BUFFER_LAYOUT_PITCH);
            } else {
                ret = mVc->setCapturePlaneFormat(V4L2_PIX_FMT_ABGR32,
                    cx, cy, V4L2_NV_BUFFER_LAYOUT_PITCH);
            }
            // init queue
            mVc->output_plane.setupPlane(V4L2_MEMORY_USERPTR, 2, false, true);
            mVc->output_plane.setStreamStatus(true);
            mVc->capture_plane.setupPlane(V4L2_MEMORY_USERPTR, 2, false, true);
            mVc->capture_plane.setStreamStatus(true);
            mVc->capture_plane.setDQThreadCallback(HostVcDQCallback);
            mVc->capture_plane.startDQThread(this);
            //
            for (uint32_t i = 0; i < mVc->capture_plane.getNumBuffers(); i++) {
                struct v4l2_buffer v4l2_buf;
                struct v4l2_plane planes[MAX_PLANES];
                memset(&v4l2_buf, 0, sizeof(v4l2_buf));
                memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

                v4l2_buf.index = i;
                v4l2_buf.m.planes = planes;
                int ret = mVc->capture_plane.qBuffer(v4l2_buf, NULL);
                if (ret < 0) {
                    printf("qBuffer failed, %s:%d\n", __FILE__, __LINE__);
                    abort();
                }
            }
            //
            for (uint32_t i = 0; i < mVc->output_plane.getNumBuffers(); i++) {
                struct v4l2_buffer v4l2_buf;
                struct v4l2_plane planes[MAX_PLANES];
                memset(&v4l2_buf, 0, sizeof(v4l2_buf));
                memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

                v4l2_buf.index = i;
                v4l2_buf.m.planes = planes;
                int ret = mVc->output_plane.qBuffer(v4l2_buf, NULL);
                if (ret < 0) {
                    printf("qBuffer failed, %s:%d\n", __FILE__, __LINE__);
                    abort();
                }
            }
        }
        return true;
    }
    // The input format must be YUV 422 8bits with UYVY order.
    int Update(BoogooInput* input, int cx, int cy)
    {
        int ret = -1;
        // Hint: grabber thread
        if (CheckAndCreate(cx, cy)) {
            struct v4l2_buffer v4l2_buf;
            struct v4l2_plane planes[MAX_PLANES];
            NvBuffer *buffer = nullptr;
            memset(&v4l2_buf, 0, sizeof(v4l2_buf));
            memset(planes, 0, sizeof(planes));
            v4l2_buf.m.planes = planes;
            uint32_t dptr = 0;
            if (mVc->output_plane.dqBuffer(v4l2_buf, &buffer, NULL, 1) == 0) {
                for (int i = 0; i < MAX_PLANES; ++i) {
                    buffer->planes[i].bytesused = 0;
                }
                // The input is UYVY absolutely, so we could memcpy, or
                // you should copy UYVY frame line by line.
                NvBuffer::NvBufferPlane& plane = buffer->planes[0];
#if 0
                uint32_t linesz = plane.fmt.bytesperpixel * plane.fmt.width;
                char* data = (char*)plane.data;
                for (uint32_t j = 0; j < plane.fmt.height; j++) {
                    memcpy(data, frame + dptr, linesz);
                    data += plane.fmt.stride;
                    dptr += linesz;
                }
#endif
                plane.bytesused = plane.fmt.stride * plane.fmt.height;
                ret = input->Retrieve(plane.data, (int)plane.bytesused, 0);
                //
                mVc->output_plane.qBuffer(v4l2_buf, NULL);
            } else {
                // if hostvc is busy, discard this frame
                ret = 0;
            }
        } else {
            Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
        }
        return ret;
    }
    //
    bool mUseClipper;
    boogoo::RGBA8Frame mFrame;
    NvVideoConverter* mVc;
};

BoogooSource::BoogooSource(const char* devname)
    : mFrame(0) // Not used in hostvc
    , mFrameSize(0)
{
    mResource = new Resource();
    mInput = new BoogooInput(devname);
}
BoogooSource::~BoogooSource()
{
    // stop thread.
    Stop();
    // delete input and source.
    delete mResource;
    delete mInput;
    // delete UYVY frame buffer
    if (mFrame) {
        delete[] mFrame;
    }
}

FrameSubscriber* BoogooSource::AddSubscriber(FrameSubscriber* sub)
{
    return mResource->AddSubscriber(sub);
}
FrameSubscriber* BoogooSource::TakeSubscriber(FrameSubscriber* sub)
{
    return mResource->TakeSubscriber(sub);
}

void BoogooSource::SetUseClipper(bool b)
{
    // Hint: main thread
    mResource->mUseClipper = b;
}

bool BoogooSource::GetUseClipper() const
{
    return mResource->mUseClipper;
}

void BoogooSource::SetResolution(int cx, int cy)
{
    if (mResource->mState) {
        mInput->SetResolution(cx, cy);
    }
}

void BoogooSource::GetResolution(int& cx, int& cy) const
{
    if (mResource->mState) {
        mInput->GetResolution(cx, cy);
    } else {
        cx = 1920;
        cy = 1080;
    }
}

bool BoogooSource::Grab(void* buffer, int sz)
{
    if (mResource->mState) {
        memcpy(buffer, mFrame, sz > mFrameSize ? mFrameSize : sz);
        return true;
    } else {
        memset(buffer, 0, sz);
        return false;
    }
}

bool BoogooSource::Start()
{
    if (mInput->Init()) {
        if (mInput->Start()) {
            return SimpleThread::Start(); // start the thread.
        }
    }
    return false;
}
#define ARRAYSZ(a) (sizeof(a) / sizeof(a[0]))
int BoogooSource::Run()
{
    static const struct _fmtlist {
        int cx, cy;
    } fmtlist[] = {
        { 1920, 1080 },
        { 1280, 720 },
        { 720, 576 }
    };
    const int kTimeout = 100;
    int sfmt = 0; // always stands for the next fmt to try
    while (!ShouldExit()) {
        if (!mInput->IsFrameAvailable(kTimeout)) {
            // if failed, try another resolution.
            mResource->mState = 0;
            mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
            if (!mInput->Reset()) {
                mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                break;
            }
            sfmt = (sfmt + 1) % ARRAYSZ(fmtlist); // prepare next fmt
        } else {
            int cx = 1280, cy = 720;
            mInput->GetResolution(cx, cy);
            int ret = mResource->Update(mInput, cx, cy);
            if (ret < 0) {
                mResource->mState = 0;
                mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
                if (!mInput->Reset()) {
                    mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                    break;
                }
                sfmt = (sfmt + 1) % ARRAYSZ(fmtlist); // prepare next fmt
                mResource->mState = 0; // update state
            } else {
                mResource->mState = 1;
            }
        }
    }
    mResource->mState = 0;

    return 0;
}
} // namespace input
  • nvtransform version
///
/// (C) 2020 CHIV
///
#include "BoogooInput.h"
#include "BoogooSource.h"
#include "domp/Mutex.h"
#include "hostvc/nvbuf_utils.h"
// STL
#include <atomic>
#include <list>
#include <string.h>
#include <string>

namespace input {

class BoogooSource::Resource {
public:
    Resource()
    {
        mState = 0;
        mUseClipper = false;
        mBufferABGR = -1;
        mBufferUYVY = -1;
        mBufferUYVYp = nullptr;
        mSession = NvBufferSessionCreate();
    }
    ~Resource()
    {
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            delete *it;
        }
        if (mBufferUYVY >= 0) { NvBufferDestroy(mBufferUYVY); }
        if (mBufferABGR >= 0) { NvBufferDestroy(mBufferABGR); }
        NvBufferSessionDestroy(mSession);
    }
    //
    std::atomic<int> mState;
    /// subscribers
    std::list<FrameSubscriber*> mSubs;
    /// protect \c mSubs
    domp::Mutex mLock;
    //
    FrameSubscriber* AddSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            if (*it == sub)
                return sub;
        }
        mSubs.push_back(sub);
        //
        return sub;
    }

    FrameSubscriber* TakeSubscriber(FrameSubscriber* sub)
    {
        domp::AutoLocker _(mLock);
        FrameSubscriber* ret = nullptr;
        for (auto it = mSubs.begin(); it != mSubs.end();) {
            if (*it == sub) {
                ret = *it;
                it = mSubs.erase(it);
                break;
            } else
                ++it;
        }
        return ret;
    }
    //
    void Broadcast(const boogoo::RGBA8Frame& frame, int error = FrameSubscriber::eOk)
    {
        domp::AutoLocker _(mLock);
        for (auto it = mSubs.begin(); it != mSubs.end(); ++it) {
            (*it)->Handle(frame, error);
        }
    }
    /// YUV422 8bits UYVY order
    bool CheckAndCreate(int cx, int cy)
    {
        bool rebuild = false;
        // 2 conditions result in rebuilding.
        if (mUseClipper && (mFrame.Cx() != 720 || mFrame.Cy() != 576)) {
            rebuild = true;
        } else if (!mUseClipper) {
            if (mFrame.Cx() != cx || mFrame.Cy() != cy) {
                rebuild = true;
            }
        }
        // rebuild the hostvc if needed.
        if (rebuild) {
            if (mBufferUYVY >= 0) { NvBufferMemUnMap(mBufferUYVY, 0, &mBufferUYVYp); NvBufferDestroy(mBufferUYVY); }
            if (mBufferABGR >= 0) { NvBufferDestroy(mBufferABGR); }
            int ret = 0;
            ret = NvBufferCreate(&mBufferUYVY, cx, cy, NvBufferLayout_Pitch, NvBufferColorFormat_UYVY);
            if (ret != 0) { mBufferUYVY = -1; return false; }
            ret = NvBufferCreate(&mBufferABGR, cx, cy, NvBufferLayout_Pitch, NvBufferColorFormat_ABGR32);
            if (ret != 0) { mBufferABGR = -1; return false; } // don't worry, the dtor will release dma buffer
            mFrame = mUseClipper ? boogoo::RGBA8Frame(720, 576) : boogoo::RGBA8Frame(cx, cy);
            //
            NvBufferMemMap(mBufferUYVY, 0, NvBufferMem_Write, &mBufferUYVYp);
        }
        return true;
    }
    // The input format must be YUV 422 8bits with UYVY order.
    int Update(BoogooInput* input, int cx, int cy)
    {
        int ret = -1;
        // Hint: grabber thread
        if (CheckAndCreate(cx, cy)) {
            // load to buffer.
            ret = input->Retrieve(mBufferUYVYp, 2*cx*cy, 0);
            // convert
            NvBufferTransformParams tps;
            tps.dst_rect = NvBufferRect{0, 0, (unsigned int)mFrame.Cx(), (unsigned int)mFrame.Cy()};
            tps.src_rect = mUseClipper ? NvBufferRect{44, 64, 1160, 645} : NvBufferRect{0, 0, (unsigned int)cx, (unsigned int)cy};
            tps.session = mSession;
            tps.transform_filter = NvBufferTransform_Filter_Bilinear;
            tps.transform_flag = NVBUFFER_TRANSFORM_CROP_SRC | NVBUFFER_TRANSFORM_CROP_DST; // crop src only
            tps.transform_flip = NvBufferTransform_None;
            ret = NvBufferTransform(mBufferUYVY, mBufferABGR, &tps);
            if (ret == 0) {
                NvBuffer2Raw(mBufferABGR, 0, mFrame.Cx(), mFrame.Cy(), mFrame.Data());
                Broadcast(mFrame);
            }
            //
        } else {
            Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
        }
        return ret;
    }
    //
    bool mUseClipper;
    boogoo::RGBA8Frame mFrame;
    int mBufferUYVY; // input
    int mBufferABGR; // output
    void* mBufferUYVYp;
    NvBufferSession mSession;
};

BoogooSource::BoogooSource(const char* devname)
    : mFrame(0) // Not used in hostvc
    , mFrameSize(0)
{
    mResource = new Resource();
    mInput = new BoogooInput(devname);
}
BoogooSource::~BoogooSource()
{
    // stop thread.
    Stop();
    // delete input and source.
    delete mResource;
    delete mInput;
    // delete UYVY frame buffer
    if (mFrame) { delete[] mFrame; }
}

FrameSubscriber* BoogooSource::AddSubscriber(FrameSubscriber* sub)
{
    return mResource->AddSubscriber(sub);
}
FrameSubscriber* BoogooSource::TakeSubscriber(FrameSubscriber* sub)
{
    return mResource->TakeSubscriber(sub);
}

void BoogooSource::SetUseClipper(bool b)
{
    // Hint: main thread
    mResource->mUseClipper = b;
}

bool BoogooSource::GetUseClipper() const
{
    return mResource->mUseClipper;
}

void BoogooSource::SetResolution(int cx, int cy)
{
    if (mResource->mState) {
        mInput->SetResolution(cx, cy);
    }
}

void BoogooSource::GetResolution(int& cx, int& cy) const
{
    if (mResource->mState) {
        mInput->GetResolution(cx, cy);
    } else {
        cx = 1920;
        cy = 1080;
    }
}

bool BoogooSource::Grab(void* buffer, int sz)
{
    if (mResource->mState) {
        memcpy(buffer, mFrame, sz > mFrameSize ? mFrameSize : sz);
        return true;
    } else {
        memset(buffer, 0, sz);
        return false;
    }
}

bool BoogooSource::Start()
{
    if (mInput->Init()) {
        if (mInput->Start()) {
            return SimpleThread::Start(); // start the thread.
        }
    }
    return false;
}
#define ARRAYSZ(a) (sizeof(a) / sizeof(a[0]))
int BoogooSource::Run()
{
    static const struct _fmtlist {
        int cx, cy;
    } fmtlist[] = {
        { 1920, 1080 },
        { 1280, 720 },
        { 720, 576 }
    };
    const int kTimeout = 100;
    int sfmt = 0; // always stands for the next fmt to try
    while (!ShouldExit()) {
        if (!mInput->IsFrameAvailable(kTimeout)) {
            // if failed, try another resolution.
            mResource->mState = 0;
            mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
            if (!mInput->Reset()) {
                mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                break;
            }
            sfmt = (sfmt + 1) % ARRAYSZ(fmtlist); // prepare next fmt
        } else {
            int cx = 1280, cy = 720;
            mInput->GetResolution(cx, cy);
            int ret = mResource->Update(mInput, cx, cy);
            if (ret < 0) {
                mResource->mState = 0;
                mInput->SetResolution(fmtlist[sfmt].cx, fmtlist[sfmt].cy);
                if (!mInput->Reset()) {
                    mResource->Broadcast(boogoo::RGBA8Frame(), FrameSubscriber::eSystemFault);
                    break;
                }
                sfmt = (sfmt + 1) % ARRAYSZ(fmtlist); // prepare next fmt
                mResource->mState = 0; // update state
            } else {
                mResource->mState = 1;
            }
        }
    }
    mResource->mState = 0;

    return 0;
}

} // namespace input

Hi,
Since NvBufferTransfrom() is a pure hardware operation, it should not be the bottleneck. The performance looks ot be capped by

ret = input->Retrieve(mBufferUYVYp, 2*cx*cy, 0);

or

NvBuffer2Raw(mBufferABGR, 0, mFrame.Cx(), mFrame.Cy(), mFrame.Data());

Please run sudo nvpmodel -m 0 and sudo jetson_clocks to get max performance. You can check the system loading through sudo tegrastats.

Hi,

Yes, you are right. NvBufferTransform is fast enough to complete our task without memory movement. But how should we transfer our UYVY buffer to it ? I tried Raw2NvBuffer, nothing improved.

And, I am confused heavily. After sudo nvpmodel -m 0 and sudo jetson_clocks, I got a lower FPS with cuda.

As contrast, I tested my video capturer. In the capture thread, I read the video data to a frame buffer(allocated by new operator), without colorspace conversation, the final FPS is 57 for 4 1920x1080 input signals.

Before developing this product, I tested the performance of cuda, it could handle about 2000 1920x1080 frames per second. As demo, I read 6 input signals, and transform it to ABGR32 with cuda, then move them to OpenGL context with cuda API, it worked well. Now, we move the result into hosted memory, because our customers want to process it themselves. They really really love BGR.

Thanks for your patience.

1 Like

I confirmed that. sudo nvpmodel -m 0 results in lower performance. It affected the NvBufferTransform, CUDA.

Hi,
Not sure why you see worse performance in mode 0. It it the mode with highest clocks in all hardware components. Suppose to bring max performance. All modes are listed in developer guide.

BGR format is not supported by hardware converter engine VIC, so it may not bring better performance by using VIC. It supports RGBA and you would still need to re-sample to BGR after the conversion. The best performance of this usecase should be using only CUDA.

Hi,
I wonder why too. But it’s the truth. I get highest performance at mode 1, 3, others result in worse performance. I guess that the limitation is I/O bandwidth. The largest input is 1920x1080x2x60x4 = 995,328,000 bytes/seconds, the output is 1920x1080x4x60x4=‭1,990,656,000‬ bytes/seconds. I tested the access speed of memory, it’s about 2~3 GB/s.

Hi,

We have seen the issue in MAXN mode and have certain change on r32.4.3. Please take a look at

1 Like