Import block linear NvSciBufObj with pitch to CUDA

AdamBalazsVay · March 13, 2024, 1:01pm

Software Version
DRIVE OS Linux 6.0.8.1 and DriveWorks 5.10

Hardware Platform
DRIVE AGX Orin

Hi,

We try to port our codebase from DriveOS 5.2.6 to DriveOS 6.0.8.1.

We want to access our block linear NvSciBufObj images from CUDA, so we import them as cudaMipmappedArray and cudaArray as it is shown in the following CUDA example:

github.com

NVIDIA/cuda-samples/blob/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/4_CUDA_Libraries/cudaNvSciNvMedia/cuda_consumer.cu#L140-L239


      
          static void cudaImportNvSciImage(cudaExternalResInterop &cudaExtResObj,

                                           NvSciBufObj &inputBufObj) {

            NvSciBufModule module = NULL;

            NvSciBufAttrList attrlist = NULL;

            NvSciBufAttrKeyValuePair pairArrayOut[10];

          

            checkNvSciErrors(NvSciBufModuleOpen(&module));

            checkNvSciErrors(NvSciBufAttrListCreate(module, &attrlist));

            checkNvSciErrors(NvSciBufObjGetAttrList(inputBufObj, &attrlist));

          

            memset(pairArrayOut, 0, sizeof(NvSciBufAttrKeyValuePair) * 10);

          

            int numAttrs = 0;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_Size;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_PlaneChannelCount;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_PlaneCount;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_PlaneWidth;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_PlaneHeight;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_Layout;

            pairArrayOut[numAttrs++].key = NvSciBufImageAttrKey_PlaneBitsPerPixel;

This file has been truncated. show original

Under DriveOS 5.2.6 it worked well, but under DriveOS 6.0.8.1 our CUDA based algorithms produce broken images.

The root cause is that under DriveOS 6 during the image allocation we get a different plane pitch than the plane width even for the block linear images. Under DriveOS 5.2.6 they are the same. Plane pitch is an output only parameter.

As far as I know using an aligned pitch instead of the width only makes sense when the image is pitch linear. For block linear images there is no point to use different pitch than the width, this is why CUDA doesn’t handle pitch, just width when importing mipmapped arrays (make_cudaExtent).

Could you please share some sample code how to import a block linear NvSciBufObj image that has a different plane pitch than plane width? The official CUDA sample app doesn’t handle this case, it uses plane width and ignores plane pitch.

Thank you, kind regards,
Adam

AdamBalazsVay · March 13, 2024, 1:54pm

As I see there is a bug opened on the CUDA sample app:

github.com/NVIDIA/cuda-samples

A bug of 4_CUDA_Libraries/cudaNvSciNvMedia/cuda_consumer.cu

opened 06:45AM - 03 Feb 23 UTC

Wenlong-Zhu

Get the wrong result when referring to the sample for setting the cudaExtent.wid…th. Because of the change in NvSciBuf's padding size, on Drive OS 6.0.5.0 Linux. the cudaExtent.width should be set to the value of NvSciBufImageAttrKey_PlanePitch/(NvSciBufImageAttrKey_PlaneBitsPerPixel/8) ; // https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaExtent.html#structcudaExtent // Width in elements when referring to array memory, in bytes when referring to linear memory // NvSciBufImageAttrKey_PlanePitch: Outputs the pitch (aka width in bytes) for every plane. // You can find the information on NvOnline Bug 3880762

Using the pitch instead of the width here would not be a proper solution, as a CUDA kernel would process the part between width and pitch as that part would be considered to be part of the image (the cudaArray or the surface object created on top of the cudaArrays).

AdamBalazsVay · March 20, 2024, 2:24pm

Dear NVIDIA support,

Do you have any update on this?

VickNV · March 20, 2024, 6:56pm

According to the documentation, PlaneWidth and PlanePitch are expected to have different values. Why are the same as you observed in DRIVE OS 5.2.6? Could you elaborate on what changes in the two attributes from DRIVE 6 have caused the issue you’re facing?

VickNV · March 20, 2024, 9:12pm

Please refer to the following code snippet for passing the aligned values computed by NvSciBuf through the cudaExtent struct:

width = planePitch[i] / (planeBitsPerPixel[i] / 8);
height = PlaneAlignedHeight[i];
depth = 0;

AdamBalazsVay · March 21, 2024, 7:44am

Hi @VickNV,

Thank you very much for your reply. Yes, we know that in theory plane pitch and plane width can be different. It makes sense if NvSciBufImageAttrKey_Layout is NvSciBufImage_PitchLinearType, then we can have aligned lines. But for block linear images having a different pitch than width doesn’t make any sense for me as it is a cache-optimized layout anyway so line length doesn’t affect performance (could you explain why it is beneficial?), I think this is why CUDA API doesn’t handle different pitch and width in cudaArrays. But it is not limited by the DriveOS API, you are right.

We haven’t modified the CUDA import code when ported our codebase from DriveOS 5.2.6 to DriveOS 6.0.8.1, but the images with the DriveOS 6 version are broken. We traced down that the actual image widths and image pitches we got for block linear images were the same in DriveOS 5.2.6, but they are different in DriveOS 6.0.8.1. Based on the DriveOS API it is not a bug, but we were surprised as for block linear images we didn’t expect different pitch and width, and even the official CUDA sample application doesn’t handle that case.

width = planePitch[i] / (planeBitsPerPixel[i] / 8);
height = PlaneAlignedHeight[i];
depth = 0;

If we use the pitch as the width of the cudaArray then the padded area (that is not part of the image, the area at the end of each line between width … pitch) becomes part of the cudaArray and the cudaSurface. That part should not be part of the image.

In my opinion either pitch should be equal to width in the case of block linear images or the API that imports the memory to cudaArray should handle the case when pitch != width (but CUDA API doesn’t support it).

VickNV · March 21, 2024, 5:36pm

The pixel data layout within a block varies across NVIDIA SoC architectures, and consequently, the plane pitch and plane width of the block-linear format may differ. For Orin, specifically, the block width is 64 bytes, resulting in a plane pitch aligned to 64 bytes (not considering the requirements of other engines).

AdamBalazsVay · March 21, 2024, 5:46pm

I see, thank you.

But then to have proper CUDA ↔ NvSciBufObj interoperability we need a way to import a cudaArray with a pitch != width. All we can specify when importing the cudaArray is the width. If we use the pitch instead of the width the cudaArray will contain the padding elements between width … pitch in the end of every line. Those pixels are not part of the image, but still, the cudaArray would contain them.

VickNV · March 21, 2024, 5:59pm

The code snippet I provided addressed the image broken problem in the internal application as mentioned in the GitHub issue you shared. Could you confirm if this solution does not resolve your current issue?

AdamBalazsVay · March 21, 2024, 7:25pm

I can confirm it works around the issue and produces the correct output. But you see the problem with this approach, right, as with the modification the cudaArray maps the padded area as it was a meaningful part part of the image, but it is not.

AdamBalazsVay · March 26, 2024, 3:31pm

Dear @VickNV,

Seemingly the issue is more complex than I thought. I created a test application that compiles on both DriveOS 5.6.2 and 6.0.8.1.
When I create the NvSciBufObj I specify image width 2912.

On DriveOS 5 the raw image is 2912 pixels wide, but the pitch is 2944, so seemingly it is 128 bytes aligned. So the whole Y plane is 2944 * 1892 (instead of 2912 * 1892).
On DriveOs 6 the raw image is 2912 pixels wide, but the pitch is 3072, so it aligns to 256 bytes. So the whole Y plane is 3072 * 1892 (instead of 2912 * 1892).

So in both versions width != pitch.
Even if I use the code from the official CUDA sample, using width instead of pitch:

cudaExtent const extent{make_cudaExtent(planeWidth[planeIndex], planeHeight[planeIndex], 0)};

On DriveOS5 I get a perfect image, while on DriveOS6 the image is broken. Pitch != width in the case of the DriveOS 5 version as well, so the cudaArray is not aware that the underlying memory has a different pitch than the cudaArray’s width. But somehow, magically it uses the real pitch when copying. Under DriveOS 6 it is not the case, the output image is broken.

The full source code of my sample application can be found here, you can compile with the commands in the first two lines: CUDA <-> NvMedia - Pastebin.com

In line 429 if you use the commented out code instead of width the code starts to work under DriveOS 6 as well.

cudaExtent const extent{make_cudaExtent(planePitch[planeIndex] / channelCountPerPlane[planeIndex], planeHeight[planeIndex], 0)};

So there must be some difference under the hood that made the DriveOS 5 version using the width as a cudaExtent’s width work even when the pitch != width. But that mechanism got broken on DriveOS 6.

VickNV · March 26, 2024, 8:41pm

The link is currently unavailable.

AdamBalazsVay · March 26, 2024, 8:49pm

Oh, here it’s still available. Anyway I paste it here:

// gcc -o cuda_nvmedia_test ./samples/cuda_nvmedia_test.cpp -I ./nvmedia_include/usr/include/ -I /usr/local/cuda-11.4/targets/aarch64-linux/include -std=c++17 -lnvmedia_dla -lnvscibuf -lnvscisync -lnvmedia2d -lnvmedia_iep_sci -lnvmedia_ide_sci -lnvmedialdc -lpthread -lstdc++ -L /usr/local/cuda-11.4/targets/aarch64-linux/lib/ -lcudart -DDRIVEOS_6

// gcc -o cuda_nvmedia_test ./samples/cuda_nvmedia_test.cpp -I /usr/local/cuda-10.2/targets/aarch64-linux/include -std=c++17 -lnvmedia_dla -lnvscibuf -lnvscisync -lnvmedia_2d -lnvmedia_core -lnvmedia -lpthread -lstdc++ -L /usr/local/cuda-10.2/targets/aarch64-linux/lib/ -lcudart -DDRIVEOS_5_COMPAT

#include <nvscibuf.h>
#include <nvscisync.h>
#ifdef DRIVEOS_5_COMPAT
    #include <nvmedia_surface.h>
    #include <nvmedia_ldc.h>
    #include <nvmedia_2d.h>
    #include <nvmedia_image.h>
    #include <nvmedia_image_nvscibuf.h>
    #include <nvmedia_2d_nvscisync.h>
    #include <nvmedia_ldc_nvscisync.h>
#else
    #include <nvmedia_6x/nvmedia_ldc.h>
    #include <nvmedia_6x/nvmedia_2d.h>
    #include <nvmedia_6x/nvmedia_2d_sci.h>
    #include <nvmedia_6x/nvmedia_ldc_sci.h>
    #include <nvmedia_6x/nvmedia_iep.h>
    #include <nvmedia_6x/nvmedia_ldc_util.h>
#endif
#include <cuda_runtime.h>

#include <stddef.h>
#include <cstdint>
#include <vector>
#include <memory>
#include <cstdio>
#include <type_traits>
#include <chrono>
#include <thread>
#include <fstream>
#include <iostream>

#define LOG_ERR(...) printf(__VA_ARGS__); printf("\n")

static constexpr size_t imgW = 2912;
static constexpr size_t imgH = 1080;

std::array<uint8_t, imgW * imgH> inImgCudaY;
std::array<uint8_t, (imgW) * (imgH / 2)> inImgCudaUV;

std::array<uint8_t, imgW * imgH> outImgCpuY;
std::array<uint8_t, (imgW / 2) * (imgH / 2)> outImgCpuU;
std::array<uint8_t, (imgW / 2) * (imgH / 2)> outImgCpuV;
std::array<void*, 3> outImgPtrs{outImgCpuY.data(), outImgCpuU.data(), outImgCpuV.data()};

using ScopedSciBufObj = std::unique_ptr<NvSciBufObjRefRec, std::integral_constant<decltype(NvSciBufObjFree)*, NvSciBufObjFree>>;
using ScopedCpuWaitContext = std::unique_ptr<NvSciSyncCpuWaitContextRec, std::integral_constant<decltype(NvSciSyncCpuWaitContextFree)*, NvSciSyncCpuWaitContextFree>>;
using ScopedNvMedia2D = std::unique_ptr<NvMedia2D, std::integral_constant<decltype(NvMedia2DDestroy)*, NvMedia2DDestroy>>;
using ScopedNvSciSyncAttrList = std::unique_ptr<NvSciSyncAttrListRec, std::integral_constant<decltype(NvSciSyncAttrListFree)*, NvSciSyncAttrListFree>>;
using ScopedNvSciBufAttrList = std::unique_ptr<NvSciBufAttrListRec, std::integral_constant<decltype(NvSciBufAttrListFree)*, NvSciBufAttrListFree>>;
using ScopedSciSyncObj = std::unique_ptr<NvSciSyncObjRec, std::integral_constant<decltype(NvSciSyncObjFree)*, NvSciSyncObjFree>>;

#ifdef DRIVEOS_5_COMPAT
    static void initNvMediaImageNvSciBuf()
    {
        static constexpr auto initLambda{[]() -> void* {
            
            if(auto const err{NvMediaImageNvSciBufInit()}; err != NVMEDIA_STATUS_OK) {
                LOG_ERR("Could not init NvMediaImageNvSciBuf: %d", err);
                return nullptr;
            }
            return reinterpret_cast<void*>(1); // important to return something non-null (unique_ptr won't call the deleter with nullptr)
        }};

        static constexpr auto dummyDeleterFunctionPtr{
            +[](void*) {
                #ifdef DRIVEOS_5_COMPAT
                    NvMediaImageNvSciBufDeinit();
                #endif
            }
        };

        static std::unique_ptr<void, void(*)(void*)> const nvMediaSciBufInitDeinit{
            initLambda(),
            dummyDeleterFunctionPtr
        };
    }

    static NvMediaDevice* getNvMediaDevice()
    {
        static constexpr auto initLambda{[]() {
            initNvMediaImageNvSciBuf();
            auto const nvMediaDevice{NvMediaDeviceCreate()};
            if(nvMediaDevice == nullptr) {
                LOG_ERR("Could not create NvMediaDevice");
            }
            return nvMediaDevice;
        }};

        static std::unique_ptr<NvMediaDevice, std::integral_constant<decltype(NvMediaDeviceDestroy)*, NvMediaDeviceDestroy>> const nvMediaDevice{initLambda()};
        return nvMediaDevice.get();
    }
#endif

static NvSciBufModule getNvSciBufModule()
{
    static constexpr auto initLambda{[]() -> NvSciBufModule {
        #ifdef DRIVEOS_5_COMPAT
            initNvMediaImageNvSciBuf();
        #endif
        NvSciBufModule rawNvSciBufModule{};
        if(auto const error{NvSciBufModuleOpen(&rawNvSciBufModule)}; error != NvSciError_Success)
        {
            LOG_ERR("NvSciBufModuleOpen failed: %d", error);
            return nullptr;
        }
        return rawNvSciBufModule;
    }};

    static std::unique_ptr<NvSciBufModuleRec, std::integral_constant<decltype(NvSciBufModuleClose)*, NvSciBufModuleClose>> const nvSciBufModule{initLambda()};
    return nvSciBufModule.get();
}

static NvSciSyncModule getNvSciSyncModule()
{
    static constexpr auto initLambda{[]() -> NvSciSyncModule {
        #ifdef DRIVEOS_5_COMPAT
            initNvMediaImageNvSciBuf();
        #endif
        NvSciSyncModule rawNvSciSyncModule{};
        if(auto const error{NvSciSyncModuleOpen(&rawNvSciSyncModule)}; error != NvSciError_Success)
        {
            LOG_ERR("NvSciSyncModuleOpen failed: %d", error);
            return nullptr;
        }
        return rawNvSciSyncModule;
    }};

    static std::unique_ptr<NvSciSyncModuleRec, std::integral_constant<decltype(NvSciSyncModuleClose)*, NvSciSyncModuleClose>> const nvSciSyncModule{initLambda()};
    return nvSciSyncModule.get();
}

#ifdef DRIVEOS_5_COMPAT
    NvSciBufAttrList nvMediaSurfAttrsToNvSciBufAttrList(NvMediaSurfaceType surfType, NvMediaSurfAllocAttr const *surfAttrs, uint32_t numSurfAttrs, NvSciBufAttrValAccessPerm accessPerm)
    {  
        // NvMedia
        ScopedNvSciBufAttrList nvmediaAttr{nullptr};
        {
            NvSciBufAttrList rawNvMediaAttr{nullptr};
            if (auto const err{NvSciBufAttrListCreate(getNvSciBufModule(), &rawNvMediaAttr)}; NvSciError_Success != err) {
                LOG_ERR("Couldn't create NvSciBufAttr");
                return nullptr;
            }
            nvmediaAttr.reset(rawNvMediaAttr);
        }

        if (auto const err{NvMediaImageFillNvSciBufAttrs(getNvMediaDevice(), surfType, surfAttrs, numSurfAttrs, 0, nvmediaAttr.get())}; NVMEDIA_STATUS_OK != err) {
            LOG_ERR("Couldn't fill NvSciBufAttrs");
            return nullptr;
        }


        // CUDA + NPPI
        ScopedNvSciBufAttrList cudaAttr{nullptr};
        {
            static constexpr NvSciBufType bufType = NvSciBufType_Image;

            {
                NvSciBufAttrList rawCudaAttr{nullptr};
                if (auto const err{NvSciBufAttrListCreate(getNvSciBufModule(), &rawCudaAttr)}; NvSciError_Success != err) {
                    LOG_ERR("Couldn't create NvSciBufAttr");
                    return nullptr;
                }
                cudaAttr.reset(rawCudaAttr);
            }

            {
                NvSciBufAttrKeyValuePair setAttrs[] = {                              
                    { NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
                    { NvSciBufGeneralAttrKey_RequiredPerm, &accessPerm, sizeof(accessPerm) },
                };

                if (auto const err{NvSciBufAttrListSetAttrs(cudaAttr.get(), setAttrs, sizeof(setAttrs) / sizeof(NvSciBufAttrKeyValuePair))}; NvSciError_Success != err) {
                    LOG_ERR("Couldn't fill NvSciBufAttrs");
                    return nullptr;
                }
            }
        }

        /* Reconcile the NvSciBufAttrs and then allocate an NvSciBufObj. */
        NvSciBufAttrList attrList[2] = {nvmediaAttr.get(), cudaAttr.get()};
        NvSciBufAttrList reconciled{nullptr};
        NvSciBufAttrList conflicts{nullptr};
        if (auto const err{NvSciBufAttrListReconcile(attrList, 2, &reconciled, &conflicts)}; NvSciError_Success != err) {
            LOG_ERR("Couldn't reconcile NvSciBufAttr and allocate NvSciBufObj");
            if(conflicts != nullptr)
            {
                NvSciBufAttrListFree(conflicts);
            }
        }
        return reconciled;
    }
#else
    static const std::vector<NvSciRmGpuId>& getGpuIds()
    {
        static const auto initLambda{[]() -> std::vector<NvSciRmGpuId> {
            int cudaDeviceCount{0};
            if (auto const err{cudaGetDeviceCount(&cudaDeviceCount)}; err != cudaSuccess)
            {
                LOG_ERR("Couldn't get cuda device count: %d", err);
                return {};
            }

            std::vector<NvSciRmGpuId> result{};
            result.reserve(cudaDeviceCount);

            for(int currDevice{0}; currDevice < cudaDeviceCount; ++currDevice)
            {
                cudaDeviceProp currDeviceProp{};
                if (auto const err{cudaGetDeviceProperties(&currDeviceProp, currDevice)}; err != cudaSuccess)
                {
                    LOG_ERR("Couldn't get cuda device properties: %d", err);
                    return {};
                }
                result.emplace_back(NvSciRmGpuId{});
                memcpy(result.back().bytes, currDeviceProp.uuid.bytes, 16 * sizeof(uint8_t));
            }
            return result;
        }};

        static const std::vector<NvSciRmGpuId> gpuIds{initLambda()};
        return gpuIds;
    }
#endif

ScopedSciBufObj allocateImage(uint32_t width, uint32_t height)
{
    ScopedNvSciBufAttrList imageAttrReconciled{nullptr};

    #ifdef DRIVEOS_5_COMPAT
        {
            NvMediaSurfAllocAttr surfAllocAttr[] = {
                { NVM_SURF_ATTR_WIDTH, width },
                { NVM_SURF_ATTR_HEIGHT, height },
                { NVM_SURF_ATTR_EMB_LINES_TOP, 0 },
                { NVM_SURF_ATTR_EMB_LINES_BOTTOM, 0 },
                { NVM_SURF_ATTR_CPU_ACCESS, NVM_SURF_ATTR_CPU_ACCESS_CACHED },
                { NVM_SURF_ATTR_ALLOC_TYPE, NVM_SURF_ATTR_ALLOC_ISOCHRONOUS },
                { NVM_SURF_ATTR_PEER_VM_ID, 0 },
                { NVM_SURF_ATTR_SCAN_TYPE, NVM_SURF_ATTR_SCAN_PROGRESSIVE },
                { NVM_SURF_ATTR_COLOR_STD_TYPE, NVM_SURF_ATTR_COLOR_STD_REC709_ER }
            };

            NvMediaSurfFormatAttr surfFormatAttr[NVM_SURF_FMT_ATTR_MAX]{};
            NVM_SURF_FMT_SET_ATTR_YUV(surfFormatAttr, YUV, 420, SEMI_PLANAR, UINT, 8, BL);

            NvMediaSurfaceType const surfType = NvMediaSurfaceFormatGetType(surfFormatAttr, NVM_SURF_FMT_ATTR_MAX);

            NvSciBufAttrList const rawImageAttrReconciled{nvMediaSurfAttrsToNvSciBufAttrList(surfType, surfAllocAttr, sizeof(surfAllocAttr) / sizeof(NvMediaSurfAllocAttr), NvSciBufAccessPerm_ReadWrite)};
            imageAttrReconciled.reset(rawImageAttrReconciled);
        }
    #else
        static constexpr NvSciBufAttrValImageLayoutType layout{NvSciBufImage_BlockLinearType};
        static constexpr NvSciBufAttrValAccessPerm accessPerm{NvSciBufAccessPerm_ReadWrite};
        static constexpr NvSciBufType bufType{NvSciBufType_Image};
        static constexpr bool needCPUAccess{true};
        static constexpr bool enableCpuCache{true};
        static constexpr uint32_t planeCount{2};
        static constexpr uint64_t padding[]{0, 0};
        static constexpr NvSciBufAttrValColorFmt colorFormat[]{NvSciColor_Y8, NvSciColor_U8_V8};
        static constexpr NvSciBufAttrValColorStd colorStd[]{NvSciColorStd_REC709_ER, NvSciColorStd_REC709_ER}; 
        static constexpr int32_t planeBaseAddrAlign[]{1024, 1024};
        uint32_t const planeWidth[]{width, width / 2};
        uint32_t const planeHeight[]{height, height / 2};
        static constexpr NvSciBufAttrValImageScanType scanType{NvSciBufScan_ProgressiveType};
        static constexpr bool vpr{false};
        static constexpr uint64_t imageCount{1};
        static std::vector<NvSciRmGpuId> const& gpuIds{getGpuIds()};
        static std::vector<NvSciBufAttrValGpuCache> const& gpuCache{[](){
            std::vector<NvSciBufAttrValGpuCache> result{};
            result.reserve(gpuIds.size());
            for(auto const& currGpu: gpuIds) {
                result.emplace_back(NvSciBufAttrValGpuCache{currGpu, false});
            }
            return result;
        }()};

        if(gpuIds.empty())
        {
            LOG_ERR("Couldn't get GPU vector.");
            return nullptr;
        }

        ScopedNvSciBufAttrList imageAttr{nullptr};
        {
            {
                NvSciBufAttrList rawImageAttr{nullptr};
                if (auto const err{NvSciBufAttrListCreate(getNvSciBufModule(), &rawImageAttr)}; NvSciError_Success != err) {
                    LOG_ERR("Couldn't create NvSciBufAttr: %d", err);
                    return nullptr;
                }
                imageAttr.reset(rawImageAttr);
            }

            {
                NvSciBufAttrKeyValuePair setAttrs[] = {                              
                    { NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
                    { NvSciBufGeneralAttrKey_RequiredPerm, &accessPerm, sizeof(accessPerm) },
                    { NvSciBufGeneralAttrKey_NeedCpuAccess, &needCPUAccess, sizeof(needCPUAccess) },
                    { NvSciBufGeneralAttrKey_EnableCpuCache, &enableCpuCache, sizeof(enableCpuCache) },
                    { NvSciBufGeneralAttrKey_GpuId, gpuIds.data(), sizeof(NvSciRmGpuId) * gpuIds.size() },
                    { NvSciBufGeneralAttrKey_EnableGpuCache, gpuCache.data(), sizeof(NvSciBufAttrValGpuCache) * gpuCache.size() },
                    { NvSciBufImageAttrKey_Layout, &layout, sizeof(layout) },
                    { NvSciBufImageAttrKey_PlaneCount, &planeCount, sizeof(planeCount) },
                    { NvSciBufImageAttrKey_TopPadding, &padding, sizeof(padding) },
                    { NvSciBufImageAttrKey_BottomPadding, &padding, sizeof(padding) },
                    { NvSciBufImageAttrKey_LeftPadding, &padding, sizeof(padding) },
                    { NvSciBufImageAttrKey_RightPadding, &padding, sizeof(padding) },
                    { NvSciBufImageAttrKey_PlaneColorFormat, &colorFormat, sizeof(colorFormat) },
                    { NvSciBufImageAttrKey_PlaneColorStd, &colorStd, sizeof(colorStd) },
                    { NvSciBufImageAttrKey_PlaneBaseAddrAlign, &planeBaseAddrAlign, sizeof(planeBaseAddrAlign) },
                    { NvSciBufImageAttrKey_PlaneWidth, &planeWidth, sizeof(planeWidth) },
                    { NvSciBufImageAttrKey_PlaneHeight, &planeHeight, sizeof(planeHeight) },
                    { NvSciBufImageAttrKey_ScanType, &scanType, sizeof(scanType) },
                    { NvSciBufImageAttrKey_VprFlag, &vpr, sizeof(vpr) },
                    { NvSciBufImageAttrKey_ImageCount, &imageCount, sizeof(imageCount) },
                };

                if (auto const err{NvSciBufAttrListSetAttrs(imageAttr.get(), setAttrs, sizeof(setAttrs) / sizeof(NvSciBufAttrKeyValuePair))}; NvSciError_Success != err) {
                    LOG_ERR("Couldn't fill NvSciBufAttrs: %d", err);
                    return nullptr;
                }
            }

            // We need to call this to be able to use the buffer from HW engines
            if (auto const err{NvMedia2DFillNvSciBufAttrList(nullptr, imageAttr.get())}; NVMEDIA_STATUS_OK != err) {
                LOG_ERR("NvMedia2DFillNvSciBufAttrList failed: %d", err);
                return nullptr;
            }
        }

        {
            NvSciBufAttrList attrList[]{imageAttr.get()};
            NvSciBufAttrList reconciled{nullptr};
            NvSciBufAttrList conflicts{nullptr};
            if (auto const err{NvSciBufAttrListReconcile(attrList, sizeof(attrList) / sizeof(NvSciBufAttrList), &reconciled, &conflicts)}; NvSciError_Success != err)
            {
                LOG_ERR("Couldn't reconcile NvSciBufAttr and allocate NvSciBufObj: %d", err);
                if(conflicts != nullptr)
                {
                    void* dump{};
                    size_t len{};
                    if (auto const err{NvSciBufAttrListDebugDump(conflicts, &dump, &len)}; NvSciError_Success != err) {
                        LOG_ERR("NvSciBufAttrListDebugDump failed: %d", err);
                    } else {
                        LOG_ERR("Conflicted args: %-*s", static_cast<int>(len), static_cast<char*>(dump));
                    }
                    NvSciBufAttrListFree(conflicts);
                } else {
                    LOG_ERR("No conflict list");
                }
            }
            imageAttrReconciled.reset(reconciled);
        }
    #endif

    NvSciBufObj result{nullptr};
    if (auto const err{NvSciBufObjAlloc(imageAttrReconciled.get(), &result)}; err != NvSciError_Success) {
        LOG_ERR("NvSciBufObjAlloc failed: %d", err);
        return nullptr;
    }
    return ScopedSciBufObj{result};
}

int main()
{
    // Allocate images
    auto srcImage{allocateImage(imgW, imgH)};
    uint64_t fullBufferSize{};

    struct CUexternalMemory_st *externalMemory;
    struct cudaMipmappedArray *mipmapArray[2];
    struct cudaArray *imagePlanes[2];
    unsigned long long cudaSurfaceNvmediaBuf[2];

    {
        // Get cuda maps
        NvSciBufAttrList sciBufAttr{nullptr};
        // Don't need to free the returned attribute list
        if (auto const err{NvSciBufObjGetAttrList(srcImage.get(), &sciBufAttr)}; NvSciError_Success != err) {
            LOG_ERR("Couldn't get sciBufObj attribute list: %d", err);
        }

        NvSciBufAttrKeyValuePair queriedAttrs[]{
            { NvSciBufImageAttrKey_Size, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneChannelCount, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneOffset, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneWidth, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneHeight, nullptr, 0 },
            { NvSciBufImageAttrKey_PlanePitch, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneBitsPerPixel, nullptr, 0 },
            { NvSciBufImageAttrKey_PlaneCount, nullptr, 0 },
        };

        // Query args returned filled by NvMedia
        if (auto const err{NvSciBufAttrListGetAttrs(sciBufAttr, queriedAttrs, sizeof(queriedAttrs) / sizeof(NvSciBufAttrKeyValuePair))}; NvSciError_Success != err) {
            LOG_ERR("Couldn't get NvSciBufAttrs: %d", err);
        }

        fullBufferSize = *static_cast<uint64_t const*>(queriedAttrs[0].value);
        uint8_t const *const channelCountPerPlane{static_cast<uint8_t const*>(queriedAttrs[1].value)};
        uint64_t const *planeOffset{static_cast<uint64_t const*>(queriedAttrs[2].value)};
        uint32_t const *planeWidth{static_cast<uint32_t const*>(queriedAttrs[3].value)};
        uint32_t const *planeHeight{static_cast<uint32_t const*>(queriedAttrs[4].value)};
        uint32_t const *planePitch{static_cast<uint32_t const*>(queriedAttrs[5].value)};
        uint32_t const *planePixelBits{static_cast<uint32_t const*>(queriedAttrs[6].value)};
        uint32_t const planeCount{*static_cast<uint32_t const*>(queriedAttrs[7].value)};


        // Import external memory
        {
            cudaExternalMemoryHandleDesc memHandleDesc{};
            memHandleDesc.type = cudaExternalMemoryHandleTypeNvSciBuf;
            memHandleDesc.handle.nvSciBufObject = srcImage.get();
            memHandleDesc.size = fullBufferSize;
            if(auto const error{cudaImportExternalMemory(&externalMemory, &memHandleDesc)}; error != cudaSuccess)
            {
                LOG_ERR("Could not import external memory: %d", error);
            }
        }

        for (uint32_t planeIndex{0}; planeIndex < planeCount; ++planeIndex)
        {
            LOG_ERR("Plane %d W: %d H: %d P: %d", planeIndex, planeWidth[planeIndex], planeHeight[planeIndex], planePitch[planeIndex]);
            /* Create mipmapArray */
            cudaExtent const extent{make_cudaExtent(planeWidth[planeIndex], planeHeight[planeIndex], 0)}; // It works only on DriveOS 5.2.6, on DriveOS 6.0.8.1 the image breaks
            //cudaExtent const extent{make_cudaExtent(planePitch[planeIndex] / channelCountPerPlane[planeIndex], planeHeight[planeIndex], 0)}; // It makes both versions work
            cudaChannelFormatDesc desc{};
            uint32_t const bitsPerPixel{planePixelBits[planeIndex] / channelCountPerPlane[planeIndex]};
            switch (channelCountPerPlane[planeIndex]) {
            case 1:
                desc = cudaCreateChannelDesc(bitsPerPixel, 0, 0, 0, cudaChannelFormatKindUnsigned);
            break;
            case 2:
                desc = cudaCreateChannelDesc(bitsPerPixel, bitsPerPixel, 0, 0, cudaChannelFormatKindUnsigned);
            break;
            default: 
                LOG_ERR("Unsupported channel count");
            }

            cudaExternalMemoryMipmappedArrayDesc mipmapDesc{};
            mipmapDesc.offset = planeOffset[planeIndex];
            mipmapDesc.formatDesc = desc;
            mipmapDesc.extent = extent;
            mipmapDesc.flags = 0;
            mipmapDesc.numLevels = 1;

            if(auto const error{cudaExternalMemoryGetMappedMipmappedArray(&mipmapArray[planeIndex], externalMemory, &mipmapDesc)}; error != cudaSuccess)
            {
                LOG_ERR("Could not create mipmapped array: %d", error);
            }

            /* Create cuda arrays */

            if(auto const error{cudaGetMipmappedArrayLevel(&imagePlanes[planeIndex], mipmapArray[planeIndex], 0 /*mipmapLevel*/)}; error != cudaSuccess)
            {
                LOG_ERR("Could not create cuda array from mipmapArray: %d", error);
            }

            /* Create cuda surface */
            {
                cudaResourceDesc resourceDesc{};
                resourceDesc.resType = cudaResourceTypeArray;
                resourceDesc.res.array.array = imagePlanes[planeIndex];

                if(auto const error{cudaCreateSurfaceObject(&cudaSurfaceNvmediaBuf[planeIndex], &resourceDesc)}; error != cudaSuccess)
                {
                    LOG_ERR("Could not create cuda surface from array: %d", error);
                }
            }
        }
    }

    // init input image
    for(size_t y = 0; y < imgH; ++y)
    {
        for(size_t x = 0; x < imgW; ++x)
        {
            inImgCudaY[y * imgW + x] = (y * 128 / imgH) + (x * 128 / imgW);
            if(x % 2 == 0 && y % 2 == 0)
            {
                inImgCudaUV[(y / 2) * imgW + x] = (y * 128 / imgH) + (x * 128 / imgW);
                inImgCudaUV[(y / 2) * imgW + x + 1] = 255 - (y * 128 / imgH) + (x * 128 / imgW);
            }
        }
    }

    // Fill image from cuda
    {
        cudaError_t cudaError = cudaSuccess;
        cudaError = cudaMemcpy2DToArray(imagePlanes[0], 0, 0, inImgCudaY.data(), imgW, imgW, imgH, cudaMemcpyHostToDevice);
        if(cudaError != cudaSuccess)
        {
            LOG_ERR("Couldn't copy image to host memory, cudaMemcpy failed: %d", cudaError);
            return -1;
        }
        cudaError = cudaMemcpy2DToArray(imagePlanes[1], 0, 0, inImgCudaUV.data(), imgW, imgW, imgH / 2, cudaMemcpyHostToDevice);
        if(cudaError != cudaSuccess)
        {
            LOG_ERR("Couldn't copy image to host memory, cudaMemcpy failed: %d", cudaError);
            return -1;
        }
    }

    // Get output data
    {
        std::array<uint32_t, 3> outImgPitches{imgW, imgW / 2, imgW / 2};

        #ifdef DRIVEOS_5_COMPAT
            NvMediaImage* nvMediaImage{};
            if (auto const err{NvMediaImageCreateFromNvSciBuf(getNvMediaDevice(), srcImage.get(), &nvMediaImage)}; err != NVMEDIA_STATUS_OK) {
                LOG_ERR("NvMediaImageCreateFromNvSciBuf failed: %d", err);
                return -1;
            }
            NvMediaImageSurfaceMap surfaceMap{};
            if(auto const error{NvMediaImageLock(nvMediaImage, NVMEDIA_IMAGE_ACCESS_READ, &surfaceMap)}; error != NVMEDIA_STATUS_OK) 
            {
                LOG_ERR("NvMediaImageLock failed: %d", error);
            }
            if(auto const error{NvMediaImageGetBits(nvMediaImage, nullptr, const_cast<void**>(outImgPtrs.data()), outImgPitches.data())}; error != NVMEDIA_STATUS_OK) 
            {
                LOG_ERR("NvMediaImageGetBits failed: %d", error);
            }
            NvMediaImageUnlock(nvMediaImage);
        #else
            std::array<uint32_t, 3> outImgSizes{outImgCpuY.size(), outImgCpuU.size(), outImgCpuV.size()};
            if(auto const error{NvSciBufObjGetPixels(srcImage.get(), nullptr, outImgPtrs.data(), outImgSizes.data(), outImgPitches.data())}; error != NvSciError_Success) {
                LOG_ERR("NvSciBufObjGetPixels failed: %d", error);
                return -1;
            }
        #endif
    }

    // Save images
    {
        std::ofstream imgFs("img_cuda" + std::to_string(imgW) + "x" + std::to_string(imgH) + ".yuv", std::ios::out | std::ios::binary | std::ios::trunc);
        imgFs.write(reinterpret_cast<char*>(outImgCpuY.data()), outImgCpuY.size());
        imgFs.write(reinterpret_cast<char*>(outImgCpuU.data()), outImgCpuU.size());
        imgFs.write(reinterpret_cast<char*>(outImgCpuV.data()), outImgCpuV.size());
        imgFs.close();
    }
}

AdamBalazsVay · April 18, 2024, 6:13am

Dear @VickNV, could you reproduce the issue and see that DriveOS 5 and 6 behaves differently? Do you have an explanation what is going on under the hood?

VickNV · April 19, 2024, 5:04am

Sorry for the oversight. I’ll make sure to reproduce the issue next week and provide you with an update afterward.

VickNV · April 22, 2024, 11:41pm

Could you clarify why you use the expression “planePitch[planeIndex] / channelCountPerPlane[planeIndex]” instead of “planePitch[planeIndex] / (planeBitsPerPixel[planeIndex] / 8)”? Have you tried the latter expression, and does it work on both DRIVE OS 5 and 6?

AdamBalazsVay · April 23, 2024, 5:51am

Dear @VickNV

It was a bug indeed, but it didn’t cause any harm with NV12 images. For the UV plane channelCountPerPlane[planeIndex] is 2 exactly as planeBitsPerPixel[planeIndex] / 8. For Y plane both give 1.

But the interesting observation is that planePitch[planeIndex] / (planeBitsPerPixel[planeIndex] / 8) != planeWidth[planeIndex] in both cases (DriveOS 5 and 6), but even when we use planeWidth[planeIndex] the DriveOS 5 version works but the DriveOS 6 version produces a broken image. How can the DriveOS 5 version produce the right output, what kind of mechanism makes it work on DriveOS 5 that stopped working on DriveOS 6?

VickNV · April 23, 2024, 6:57pm

Why it works on DRIVE 5 can be attributed to the block-linear format’s structure, which is organized in Group of Bytes (GOB) with a 64-byte alignment. This means that even if a value such as 2912 is passed to a function, it is internally adjusted to a 64-byte aligned value. In this example, the value would be rounded up to 2944 to maintain the required alignment.

However, starting from DRIVE OS 6.0.5, the reconciling pitch alignment increased to 512 bytes from 64 bytes due to internal changes. So the value passed to cudaExternalMemoryGetMappedMipmappedArray must now be calculated based on planePitch.

AdamBalazsVay · April 24, 2024, 6:36am

Dear @VickNV,

Thank you very much, it was a super insightful explanation. It is super important for us to understand what is going on under the hood as there is a strange behavior under DriveOS 6.0.8.1 which we don’t understand yet.

We have a visual validation pipeline that loads a video, processes it and produces a quality score in the end. During the processing we use NvMedia and CUDA operations as well. This score is deterministic under DriveOS 5, every time we run the pipeline we get the very same score, all the images are pixel-perfect the same in the consecutive runs.

Under DriveOS 6 we get a deterministic score when we don’t use CUDA operations on the images. But whenever CUDA is involved the end score fluctuates slightly. The images looks good, there is nothing visibly broken, but the images from the consecutive runs are not pixel-perfectly the same.

Is it possible that the padded image are between width and pitch somehow affects CUDA operations, NvMedia operations or the encoder? We saw non-deterministic pipeline score under DriveOS 5 as well previously, it was caused by the memory junk if the height of the video was not dividable by 16.

We don’t have an explanation yet so we are brainstorming what can cause the slight end score fluctuation under DriveOS 6 when CUDA operations are involved, an unexpected side-effect of this changed pitch seems to be a good candidate. Do you have any idea what we should double-check? We already double and triple-checked the synchronization, it must not be the problem.

Thank you,
Adam

VickNV · April 24, 2024, 3:45pm

Could you create a new topic for this discussion and link it back to this topic for reference? Thanks.