VisionWorks+CUDA Segmentation Fault

Hi,

I try to use VisionWorks in combination with CUDA on the TX2 but it is crashing already on the first VisionWorks function call after creating the context. Here is a small code sample which already does not not work. When I remove the CUDA code, it does not crash.

main.cpp:

#include <iostream>
#include <VX/vx.h>

int main(int argc, char **argv)
{
    vx_context context = vxCreateContext();
    if (vxDirective(reinterpret_cast<vx_reference>(context), VX_DIRECTIVE_ENABLE_PERFORMANCE) != VX_SUCCESS)
        std::cerr << "Could not enable performance measurement!" << std::endl;

    return EXIT_SUCCESS;
}

kernel.cu:

#include <cuda_runtime.h>

__global__ void add(const float *a, const float *b, float *result, size_t size)
{
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;

    result[idx] = a[idx] + b[idx];
}

The segmentation fault occures already during the vxDirective call. The backtrace is not meaningfull:

#0  0x0000007fb76b1f90 in ?? () from /usr/lib/libvisionworks.so.1.6
#1  0x0000007fb76b2120 in ?? () from /usr/lib/libvisionworks.so.1.6
#2  0x0000007fb768ecd8 in vxDirective () from /usr/lib/libvisionworks.so.1.6
#3  0x0000005555556390 in main (argc=1, argv=0x7fffffef08)
    at /home/nvidia/Development/osp/src/apps/playground/src/main.cpp:7

The application is only linked against CUDA and VisionWorks. So there should not be any incompatibilities regarding other libs. See the output of ldd:

linux-vdso.so.1 =>  (0x0000007fb31b3000)
	libcudart.so.8.0 => /usr/local/cuda/lib64/libcudart.so.8.0 (0x0000007fb3139000)
	libvisionworks.so.1.6 => /usr/lib/libvisionworks.so.1.6 (0x0000007fb278d000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007fb25fe000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007fb24b7000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007fb24a3000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007fb2477000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007fb2460000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007fb23b2000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007fb2391000)
	/lib/ld-linux-aarch64.so.1 (0x0000005559929000)

So, do you have any suggestions, how to solve this issue? Thanks!

Hi,

For VisionWorks -> CUDA sample, please check our CUDA Layer Object Tracker Sample App example.
Please find our example at ‘/usr/share/visionworks/sources/samples/object_tracker_nvxcu/’.

To give a further suggestion, could you share complete source code.
Ex. Buffer allocation, launch kernel …

Thanks.

What I posted is the complete source code. The application crashes already when the CUDA code is compiled into the executable. The kernel is never called. When I remove the kernel code, the application runs without crashing.
I did some more investigation and found out that this only happens when using the shared library of the cuda runtime. With the static one, the crash does not occur. Is there a conflict between VisionWorks and the shared CUDA runtime library?

I tried to continue using the static library of the CUDA runtime. Now, the application does not crash, but still I am not able to run a kernel in a custom VisionWorks node. After running the Kernel, cudaGetLastError() returns an unknown error (30). The same code was running on a TX1 without problems. Is it because of mixing the static and shared CUDA runtime (I assume that VisionWorks links against the shared one)?

If you have some suggestions for further debugging, I would appreciate.

Hi,

Is your file a .cu file?
Please remember to write kernel code in .cu file.

Slightly modifying your source, we can run the program successfully on TX2.

#include <iostream>
#include <VX/vx.h>

__global__ void add(const float *a, const float *b, float *result, size_t size)
{
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;

    result[idx] = a[idx] + b[idx];
}

int main(int argc, char **argv)
{
    vx_context context = vxCreateContext();
    if (vxDirective(reinterpret_cast<vx_reference>(context), VX_DIRECTIVE_ENABLE_PERFORMANCE) != VX_SUCCESS)
        std::cerr << "Could not enable performance measurement!" << std::endl;

    int N = 10;
    float *a;
    float *b;
    float *result;
    cudaMallocManaged(&a, N*sizeof(float));
    cudaMallocManaged(&b, N*sizeof(float));
    cudaMallocManaged(&result, N*sizeof(float));

    for(int i=0; i<N; i++){
        a[i] = i;
        b[i] = i;
    }

    cudaDeviceSynchronize();
    add<<<1,N>>>(a, b, result, N);
    cudaDeviceSynchronize();

    for(int i=0; i<N; i++) std::cout << i << ": " << result[i] << std::endl;

    return 0;
}
nvcc topic_1024913.cu -o test -lvisionworks && test

Thanks.

Yes, my kernel is in a .cu file. Your code compiles fine, but when I try to run it, it also gives me a segmentation fault. That’s strange. I also did a reinstallation of VisionWorks and CUDA from Jetpack, but this did also not help.

This is the output when running your example with cuda-memcheck:

========= CUDA-MEMCHECK
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMallocManaged. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:test [0x40258]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMallocManaged. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:test [0x40258]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMallocManaged. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:test [0x40258]
=========
========= Error: process didn't terminate successfully
=========        The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= Internal error (20)
========= No CUDA-MEMCHECK results found

Ok, I removed the VisionWorks part in your example and it still crashes. So the issue seems to be related to CUDA and not to VisionWorks.

Hi,

Could you help us test the vectorAdd sample?

$ cd ~/NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd
$ make
$ ./vectorAdd

Thanks.

Hi,

the vectorAdd sample works. This is the output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Hi,

just a short update:

I tried to reproduce the issue on a second TX2 on the NVidia Devkit. There, your sample is working fine. The only visible difference between them is the carrier board. The non-working TX2 is mounted on a carrier board from Leopard Imaging. We connected three cameras to the carrier. Therefore we had to apply some small modifications to the kernel and the device tree. I also did the tests with the adapted Kernel/Device tree on the Nvidia Devkit. There, CUDA works fine, so I don’t think that the issue is caused by the hardware.

Is there anything else what I can try to debug this issue? I could try to do a full reflash of the TX2, but I would like to avoid this to find the cause of the issue. I want to know what went wrong to avoid this in the future.

Hi,

Want to clarify first:

The result of vectorAdd sample is from the original device, right?
So you can run our CUDA sample but hit an error with the example in comment #5?

Thanks.

Exactly. vectorAdd works, example in #5 doesn’t work on the original device.

Hi,

Could you share us which Leopard carrier board you use?
A camera expansion board or entire carrier board with multi-cameras design?

We use the the LI-TX1-CB carrier board and the IMX185 camera modules. See here:
https://www.leopardimaging.com/LI-TX1-CB-IMX185M12-T.html

Hi,

Could you share the change for us debugging?
Thanks.

Please find the applied patches here: https://box.tu-chemnitz.de/index.php/s/iw7EOGOU9GvPP7L
They are based on tegra-l4t-r28.1.

Hi,

Could you replace the cudaMallocManaged API to cudaMalloc in comment#5 and recheck it?
Thanks.

It does not crash but it also does not work with cudaMalloc (wrong results). cuda-memcheck outputs similar errors:

========= CUDA-MEMCHECK
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMalloc. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x40158]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMalloc. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x40158]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMalloc. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x40158]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMemcpy. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x490ec]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMemcpy. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x490ec]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x39e70]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x4cf68]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x39e70]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaMemcpy. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2d6984]
=========     Host Frame:./test [0x490ec]
=========
0: 6.37001e-39
1: 0
2: 0
3: 0
4: -1.80935e+31
5: 1.77965e-43
6: -2.96698e-12
7: 1.77965e-43
8: 6.37001e-39
9: 0
========= ERROR SUMMARY: 9 errors

This is the code, I was running:

#include <iostream>
#include <cstdio>

__global__ void add(const float *a, const float *b, float *result, size_t size)
{
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;

    result[idx] = a[idx] + b[idx];
    
    printf("%f", result[idx]);
}

int main(int argc, char **argv)
{
    const int N = 10;
    float *a;
    float *b;
    float *result;
    cudaMalloc(&a, N*sizeof(float));
    cudaMalloc(&b, N*sizeof(float));
    cudaMalloc(&result, N*sizeof(float));

    float aHost[N];
    float bHost[N];
    float resultHost[N];

    for(int i=0; i<N; i++){
        aHost[i] = i;
        bHost[i] = i;
    }
    
    cudaMemcpy(a, aHost, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(b, bHost, N*sizeof(float), cudaMemcpyHostToDevice);

    cudaDeviceSynchronize();
    add<<<1,N>>>(a, b, result, N);
    cudaDeviceSynchronize();

    cudaMemcpy(resultHost, result, N*sizeof(float), cudaMemcpyDeviceToHost);

    for(int i=0; i<N; i++) std::cout << i << ": " << resultHost[i] << std::endl;

    return 0;
}

Hi,

What’re your results?
We can output the expected answer:

nvidia@tegra-ubuntu:~$ ./test 
0.0000002.0000004.0000006.0000008.00000010.00000012.00000014.00000016.00000018.0000000: 0
1: 2
2: 4
3: 6
4: 8
5: 10
6: 12
7: 14
8: 16
9: 18

You can see it at the end of the cuda-memcheck log in #18. The output is:

0: 6.37001e-39
1: 0
2: 0
3: 0
4: -1.80935e+31
5: 1.77965e-43
6: -2.96698e-12
7: 1.77965e-43
8: 6.37001e-39
9: 0