What can lead to different numbers of out of bounds / misalignment errors?

I have an issue that I don’t understand. I read pointcloud data from a realsense depth camera and want to do some calculations on the pointclouds of two consecutive frames. In the code snippet I store frame A in

std::vector<glm::vec3> frameA

and frame B equivalently. In the code snippet below I fill the two vectors with toy data and it works as expected. You can run it if libglm-dev is installed (header only). My Cuda version is 10.1.

However, when I use my realsense data instead of the toy data, CUDA throws dozens or hundreds of errors (see below code snippet).

I know this is not the original code and it seems to be pointless to discuss this issue at all without the original code. Nevertheless, I hope someone can point me in the right direction.

  • Why does Cuda sometimes throw only 60 Errors and sometimes 1034?
  • Why does my original Code with the realsense pointcloud data throw an error in that line?
#include <cuda.h>
#include <glm/glm.hpp>
#include <vector>
#include <iostream>

__global__ void kernTest(glm::vec3 *input1, glm::vec3 *input2)
{
    int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (idx < 9216)
    {
        printf("%i", idx);
        glm::vec3 test = input1[idx];  //<---------------This line leads to problems in my implementation
        glm::vec3 test2 = input2[idx];
        printf("---\n");
        printf("%f %f %f\n", test[0], test[1], test[2]);
        printf("%f %f %f\n", test2[0], test2[1], test2[2]);
    }
}

int main()
{
    std::vector<glm::vec3> data1;
    for (int i = 0; i < 9216; i++)
    {
        data1.emplace_back(glm::vec3(3));
    }

    glm::vec3 *gpu_data1;
    cudaMalloc((void **)&gpu_data1, data1.size() * sizeof(glm::vec3));
    cudaMemcpy(gpu_data1, data1.data(), data1.size() * sizeof(glm::vec3), cudaMemcpyHostToDevice);

    std::vector<glm::vec3> data2;
    for (int i = 0; i < 9216; i++)
    {
        data2.emplace_back(glm::vec3(i));
    }

    glm::vec3 *gpu_data2;
    cudaMalloc((void **)&gpu_data2, data2.size() * sizeof(glm::vec3));
    cudaMemcpy(gpu_data2, data2.data(), data2.size() * sizeof(glm::vec3), cudaMemcpyHostToDevice);

    int blockSize = 1024;
    int size = (data1.size() + blockSize - 1) / blockSize;
    printf("size %i\n", size);
    dim3 fullBlocksPerGrid(10);

    printf("runnig kernel\n");
    kernTest<<<fullBlocksPerGrid, blockSize>>>(gpu_data1, gpu_data2);
    cudaDeviceSynchronize();
}

Top Error

========= Invalid __global__ read of size 4
=========     at 0x00000150 in kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*)
=========     by thread (575,0,0) in block (7,0,0)
=========     Address 0x7fd520416af4 is out of bounds
=========     Device Frame:kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*) (kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*) : 0x150)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x20d24a]
=========     Host Frame:Ada [0x397a9]
=========     Host Frame:Ada [0x39837]
=========     Host Frame:Ada [0x6fb85]
=========     Host Frame:Ada [0x22118]
=========     Host Frame:Ada [0x210be]
=========     Host Frame:Ada [0x2110d]
=========     Host Frame:Ada [0x1fefe]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000150 in kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*)
=========     by thread (574,0,0) in block (7,0,0)
=========     Address 0x7fd520416ae8 is out of bounds
=========     Device Frame:kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*) (kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*) : 0x150)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x20d24a]
=========     Host Frame:Ada [0x397a9]
=========     Host Frame:Ada [0x39837]
=========     Host Frame:Ada [0x6fb85]
=========     Host Frame:Ada [0x22118]
=========     Host Frame:Ada [0x210be]
=========     Host Frame:Ada [0x2110d]
=========     Host Frame:Ada [0x1fefe]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000150 in kernTest(glm::vec<int=3, float, glm::qualifier>*, glm::vec<int=3, float, glm::qualifier>*)


End of Error Message

========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3db593]
=========     Host Frame:Ada [0x57b46]
=========     Host Frame:Ada [0x1ff03]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaLaunchKernel. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3db593]
=========     Host Frame:Ada [0x6fbc5]
=========     Host Frame:Ada [0x22118]
=========     Host Frame:Ada [0x20b45]
=========     Host Frame:Ada [0x20b9f]
=========     Host Frame:Ada [0x1ff7d]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3db593]
=========     Host Frame:Ada [0x57b46]
=========     Host Frame:Ada [0x1ff82]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFuncGetAttributes. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3db593]
=========     Host Frame:Ada [0x66b29]
=========     Host Frame:Ada [0x25575]
=========     Host Frame:Ada [0x26edb]
=========     Host Frame:Ada [0x26ae5]
=========     Host Frame:Ada [0x268b5]
=========     Host Frame:Ada [0x266aa]
=========     Host Frame:Ada [0x257a1]
=========     Host Frame:Ada [0x20073]
=========     Host Frame:Ada [0x1546a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:Ada [0x14f8e]
=========
after reduction step 1: cudaErrorLaunchFailure: unspecified launch failure
========= ERROR SUMMARY: 68 errors

The errors you are showing here are out-of-bounds errors. Those should not be difficult to diagnose. It means that the computed index (idx) when pointer-arithmetic-added to the base pointer (input1) results in a numerical address that does not correspond to the valid allocated area for the allocation associated with input1 (for your sample code here, that would be the allocation provided for gpu_data1).

Since we’re talking about a code you have not shown, one possible reason for this might be if your allocation size for gpu_data1 varies from one run to another, and inconsistent with the size test (or grid size). Then it might be the case that differing numbers of threads are “out-of-bounds”. But we need not even propose that. This is a massively thread-parallel machine, running under an instrumented tool (cuda-memcheck). It’s kind of like starting a large race with many people running, and then have someone at the finish line holding a starters’ pistol, and telling that person"when you see someone cross the finish line, fire your gun", then telling another person “when you hear the gun, count how many people are across the finish line”. There’s no reason to expect this experiment to produce the same or even predictable results, from one run to the next.