CUDA Error: an illegal instruction was encountered when use cudaHostAlloc

Sometimes when I start the program, no error is reported. Sometimes an error will be reported: CUDA Error: an illegal instruction was encountered. My code is as follows:

        if (bgr_cpu_raw_buffer_ == nullptr || bgr_gpu_raw_buffer_ == nullptr) {
            cudaHostAlloc(reinterpret_cast<void **>(&bgr_cpu_raw_buffer_),
                          width_ * height_ * 3 / 2, cudaHostAllocMapped);
                reinterpret_cast<void **>(&bgr_gpu_raw_buffer_),
                reinterpret_cast<void *>(bgr_cpu_raw_buffer_), 0);
        memcpy(bgr_cpu_raw_buffer_, av_frame_decode_->data[0], size0);
        memcpy(bgr_cpu_raw_buffer_ + size0, av_frame_decode_->data[1], size1);
        memcpy(bgr_cpu_raw_buffer_ + size0 + size1, av_frame_decode_->data[2],
        cuda_error = senseAD::adHal::cudaYUV420PToBGR(
                bgr_gpu_raw_buffer_, bgr_gpu_out, width_, height_,
        if (cuda_error != cudaSuccess) {
                << "CUDA Error: " << cudaGetErrorString(cuda_error);

This error occurs randomly, but the probability of occurrence is quite high.

cudaError_t cudaYUV420PToBGR(unsigned char *inputYUV,
                             unsigned char *output,
                             size_t width,
                             size_t height,
                             int linesize[3],
                             cudaStream_t stream) {
    if (!inputYUV || !output) {
        return cudaErrorInvalidDevicePointer;
    if (width == 0 || height == 0) {
        return cudaErrorInvalidValue;
    unsigned int blockSize = 1024;
    unsigned int numBlocks = (width / 2 + blockSize - 1) / blockSize;
    YUV420PToBGR<<<numBlocks, blockSize>>>(
        inputYUV, output, width, height, linesize[0], linesize[1], linesize[2]);
    return cudaGetLastError();


Would you mind sharing a runnable source with us?
We want to reproduce this internally to gather more info first.

More, which JetPack version do you use?


I can’t share the executable source code, but I can describe the situation. I have seven channels of video that need to be decoded through ffmpeg-nvidia. My problematic code is to copy the decoded data and convert it to bgr. But if I apply for cudaHostAlloc data, this error will occasionally appear when I decode seven-channel video, and I can guarantee that it will not appear when decoding six-channel video.
And my JetPack version is:

 - NVIDIA Jetson Xavier NX (Developer Kit Version)
   * Jetpack 4.4 [L4T 32.4.3]
   * NV Power Mode: MODE_15W_6CORE - Type: 2
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.89
   * cuDNN:
   * TensorRT:
   * Visionworks:
   * OpenCV: NOT_INSTALLED compiled CUDA: NO
   * VPI: 0.3.7
   * Vulkan: 1.2.70


Is it possible that there is more than one processor accessing the buffer?

I can make sure that only one processor accessing the buffer.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.


We will need a reproducible source to get more info about the failure.
Is it possible to write a simple source that can hit the same error?


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.