How to run your own kernel in jetson nano

Hello.
I want to use jetsonnano to make my own kernel for opencv.
As a sample, I made a kernel that inverts negative and positive.
However, when I run it, I get the following error.

terminate called after throwing an instance of 'cv::Exception'
 what():  OpenCV(4.5.0) Kernel.cu:40: error: (-217:Gpu API call) unspecified launch failure in function 'cudaKernel'

When I pass the executable to cuda-memcheck, I get the following errors.

========= CUDA-MEMCHECK
========= Program hit cudaErrorDevicesUnavailable (error 46) due to "all CUDA-capable devices are busy or unavailable" on CUDA API call to cudaMalloc. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2fdb04]
=========     Host Frame:./gpu.exe [0x3ffac]
=========

How can I run my own kernel? Here is the source code and opencv build information.

Kernel.h

#pragma once 
#include <opencv2/core.hpp>
#include <opencv2/core/cuda.hpp>
void cudaKernel(cv::Mat &src, cv::Mat &dst);

Kernel.cu

#include <opencv2/cudev.hpp>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>


__global__ void kernel(unsigned char* src, unsigned char* dst, int cols, int rows) {
    
    int x = blockDim.x * blockIdx.x + threadIdx.x;
    int y = blockDim.y * blockIdx.y + threadIdx.y;
    if (x < cols && y < rows){
        dst[y * cols + x] = UCHAR_MAX -  src[y * cols + x];
    }

}


void cudaKernel(cv::Mat &src, cv::Mat &dst)
{
    unsigned char* p_src;
    unsigned char* p_dst;
    unsigned char* p_cpudst;

    cudaMalloc((void **)&p_src, sizeof(unsigned char) * src.cols * src.rows * src.channels());
    cudaMalloc((void **)&p_dst, sizeof(unsigned char) * dst.cols * dst.rows * dst.channels());
    cudaMallocHost((void **)&p_cpudst,sizeof(unsigned char) * dst.cols * dst.rows * dst.channels());

    cudaMemcpy(p_src, src.data, sizeof(unsigned char) * src.cols * src.rows * src.channels(), cudaMemcpyHostToDevice);
    cudaMemcpy(p_dst, dst.data, sizeof(unsigned char) * dst.cols * dst.rows * dst.channels(), cudaMemcpyHostToDevice);
    
    dim3 dimBlock(8, 8);
    dim3 dimGrid(src.cols/8, src.rows/8);
    kernel<<<dimGrid, dimBlock>>>(p_src, p_dst, src.cols, src.rows);

    cudaMemcpy(p_cpudst, p_dst, sizeof(unsigned char) * dst.cols * dst.rows * dst.channels(), cudaMemcpyDeviceToHost);
    dst = cv::Mat(dst.rows, dst.cols, CV_8UC3, p_cpudst);

    cudaFree(p_src);
    cudaFree(p_dst);
    cudaFreeHost(p_cpudst);
    CV_CUDEV_SAFE_CALL(cudaGetLastError());
    CV_CUDEV_SAFE_CALL(cudaDeviceSynchronize());
}

sample.cpp

#include <opencv2/opencv.hpp>
#include "Kernel.h"
#include <opencv2/cudaimgproc.hpp>


int main() {
    cv::Mat frame;
    cv::Mat nega;
    frame = cv::imread("lena.jpg"); 
    
    cudaKernel(frame, nega);
    cv::imshow("win", nega);
    cv::waitKey(0);
    cv::destroyAllWindows();
    return 0;
} 

Opencv Build Information

General configuration for OpenCV 4.5.0 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            /home/jetson/workspace/opencv_contrib-4.5.0/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2021-07-09T05:18:31Z
    Host:                        Linux 4.9.201-tegra aarch64
    CMake:                       3.10.2
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               RELEASE

  CPU/HW features:
    Baseline:                    NEON FP16
      required:                  NEON
      disabled:                  VFPV3

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                /usr/bin/c++  (ver 7.5.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -W -Wall -Werror=return-type -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Werror=return-type -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed  
    Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed  
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          m pthread cudart_static dl rt nppc nppial nppicc nppicom nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot python2 python3 quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking video videoio videostab xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    world
    Disabled by dependency:      -
    Unavailable:                 cnn_3dobj cvv hdf java js julia matlab ovis sfm ts viz
    Applications:                apps
    Documentation:               NO
    Non-free algorithms:         NO

  GUI: 
    GTK+:                        YES (ver 2.24.32)
      GThread :                  YES (ver 2.56.4)
      GtkGlExt:                  NO
    VTK support:                 NO

  Media I/O: 
    ZLib:                        /usr/lib/aarch64-linux-gnu/libz.so (ver 1.2.11)
    JPEG:                        /usr/lib/aarch64-linux-gnu/libjpeg.so (ver 80)
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         /usr/lib/aarch64-linux-gnu/libpng.so (ver 1.6.34)
    TIFF:                        /usr/lib/aarch64-linux-gnu/libtiff.so (ver 42 / 4.0.9)
    JPEG 2000:                   build (ver 2.3.1)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      YES (2.2.5)
    FFMPEG:                      YES
      avcodec:                   YES (57.107.100)
      avformat:                  YES (57.83.100)
      avutil:                    YES (55.78.100)
      swscale:                   YES (4.8.100)
      avresample:                NO
    GStreamer:                   YES (1.14.5)
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            pthreads

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Lapack:                      NO
    Eigen:                       YES (ver 3.3.4)
    Custom HAL:                  YES (carotene (ver 0.0.1))
    Protobuf:                    build (3.5.1)

  NVIDIA CUDA:                   YES (ver 10.2, CUFFT CUBLAS)
    NVIDIA GPU arch:             53 62 72
    NVIDIA PTX archs:

  cuDNN:                         YES (ver 8.0.0)

  OpenCL:                        YES (no extra features)
    Include path:                /home/jetson/workspace/opencv-4.5.0/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python 2:
    Interpreter:                 /usr/bin/python2.7 (ver 2.7.17)
    Libraries:                   /usr/lib/aarch64-linux-gnu/libpython2.7.so (ver 2.7.17)
    numpy:                       /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.13.3)
    install path:                lib/python2.7/dist-packages/cv2/python-2.7

  Python 3:
    Interpreter:                 /usr/bin/python3 (ver 3.6.9)
    Libraries:                   /usr/lib/aarch64-linux-gnu/libpython3.6m.so (ver 3.6.9)
    numpy:                       /usr/lib/python3/dist-packages/numpy/core/include (ver 1.13.3)
    install path:                lib/python3.6/dist-packages/cv2/python-3.6

  Python (for build):            /usr/bin/python2.7

  Java:                          
    ant:                         NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    /usr/local
-----------------------------------------------------------------

Thank you.

Hi,

There are some issues in your implementation:

1.
The output buffer nega doesn’t allocate correctly.
So it causes an error when CUDA trying to access the output buffer.

Update to the following can fix the error:

cv::Mat frame = cv::imread("lena.jpg");
cv::Mat nega  = cv::Mat::zeros(frame.size(), CV_8UC3);

2.
A segmentation fault occurs since the p_cpudst is freed before display.
Please noted that OpenCV just wraps the buffer when creating cvmat with a pointer.

3.
The sample seems not complete yet.
It’s recommended to add a handler for dimGrid if the image size doesn’t equal to multiple of 8.
And the kernel only handles 1/3 of works so part of the output is still black.

Here is the output we got with the step1 & 2 modifications:

Attached our source for your reference as well.
Kernel.cu (1.5 KB)
sample.cpp (298 Bytes)

Thanks.

Hello.

Thanks to you I was able to create the perfect lena negative positive reverse image.
Thank you very much.