Black frames with Nvidia GPU and OpenCL (Kernel output is zero)

Hi,
I’m having problems while trying to use OpenCL installed with CUDA over Nvidia GPUs. OpenCL is being used by a custom application that will perform Signal Processing operations. The application ends correctly but the OpenCL kernel output is always 0.

The application has been tested on an i.MX8 based board and runs as expected. Also, if the Intel OpenCL runtime SDK is used on Desktop it also runs as expected. So, I suspect the problem is related to the Nvidia implementation of OpenCL or maybe that some dependencies are not met.

The O.S is Ubuntu 16.04, with 4.15.0-55-generic kernel. The CUDA installation version is OpenCL 1.2 CUDA 10.1.152.

Any clue or idea will be helpful.

I’ve followed the following steps for the installation

Download and installation:

  • Download: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=deblocal
  • sudo dpkg -i cuda-repo-ubuntu1604-10-1-local-10.1.168-418.67_1.0-1_amd64.deb
  • sudo apt-key add /var/cuda-repo-*/7fa2af80.pub
  • sudo apt-get update
  • sudo apt-get install cuda
  • Install some required dependencies:

  • sudo apt-get install ocl-icd-dev
  • sudo apt-get install ocl-icd-libopencl1
  • sudo apt-get install ocl-icd-opencl-dev
  • The result from the clinfo command is the following:

    Number of platforms                               2
      Platform Name                                   NVIDIA CUDA
      Platform Vendor                                 NVIDIA Corporation
      Platform Version                                OpenCL 1.2 CUDA 10.1.152
      Platform Profile                                FULL_PROFILE
      Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
      Platform Extensions function suffix             NV
    
      Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
      Platform Vendor                                 Intel(R) Corporation
      Platform Version                                OpenCL 2.1 LINUX
      Platform Profile                                FULL_PROFILE
      Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint 
      Platform Host timer resolution                  1ns
      Platform Extensions function suffix             INTEL
    
      Platform Name                                   NVIDIA CUDA
    Number of devices                                 1
      Device Name                                     GeForce GTX 1050
      Device Vendor                                   NVIDIA Corporation
      Device Vendor ID                                0x10de
      Device Version                                  OpenCL 1.2 CUDA
      Driver Version                                  418.67
      Device OpenCL C Version                         OpenCL C 1.2 
      Device Type                                     GPU
      Device Profile                                  FULL_PROFILE
      Device Topology (NV)                            PCI-E, 01:00.0
      Max compute units                               5
      Max clock frequency                             1493MHz
      Compute Capability (NV)                         6.1
      Device Partition                                (core)
        Max number of sub-devices                     1
        Supported partition types                     None
      Max work item dimensions                        3
      Max work item sizes                             1024x1024x64
      Max work group size                             1024
      Preferred work group size multiple              32
      Warp size (NV)                                  32
      Preferred / native vector sizes                 
        char                                                 1 / 1       
        short                                                1 / 1       
        int                                                  1 / 1       
        long                                                 1 / 1       
        half                                                 0 / 0        (n/a)
        float                                                1 / 1       
        double                                               1 / 1        (cl_khr_fp64)
      Half-precision Floating-point support           (n/a)
      Single-precision Floating-point support         (core)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 Yes
        Round to infinity                             Yes
        IEEE754-2008 fused multiply-add               Yes
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  Yes
      Double-precision Floating-point support         (cl_khr_fp64)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 Yes
        Round to infinity                             Yes
        IEEE754-2008 fused multiply-add               Yes
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  No
      Address bits                                    64, Little-Endian
      Global memory size                              4238737408 (3.948GiB)
      Error Correction support                        No
      Max memory allocation                           1059684352 (1011MiB)
      Unified memory for Host and Device              No
      Integrated memory (NV)                          No
      Minimum alignment for any data type             128 bytes
      Alignment of base address                       4096 bits (512 bytes)
      Global Memory cache type                        Read/Write
      Global Memory cache size                        81920
      Global Memory cache line                        128 bytes
      Image support                                   Yes
        Max number of samplers per kernel             32
        Max size for 1D images from buffer            134217728 pixels
        Max 1D or 2D image array size                 2048 images
        Max 2D image size                             16384x32768 pixels
        Max 3D image size                             16384x16384x16384 pixels
        Max number of read image args                 256
        Max number of write image args                16
      Local memory type                               Local
      Local memory size                               49152 (48KiB)
      Registers per block (NV)                        65536
      Max constant buffer size                        65536 (64KiB)
      Max number of constant args                     9
      Max size of kernel argument                     4352 (4.25KiB)
      Queue properties                                
        Out-of-order execution                        Yes
        Profiling                                     Yes
      Prefer user sync for interop                    No
      Profiling timer resolution                      1000ns
      Execution capabilities                          
        Run OpenCL kernels                            Yes
        Run native kernels                            No
        Kernel execution timeout (NV)                 Yes
      Concurrent copy and kernel execution (NV)       Yes
        Number of async copy engines                  2
      printf() buffer size                            1048576 (1024KiB)
      Built-in kernels                                
      Device Available                                Yes
      Compiler Available                              Yes
      Linker Available                                Yes
      Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
    
      Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
    Number of devices                                 1
      Device Name                                     Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
      Device Vendor                                   Intel(R) Corporation
      Device Vendor ID                                0x8086
      Device Version                                  OpenCL 2.1 (Build 0)
      Driver Version                                  18.1.0.0920
      Device OpenCL C Version                         OpenCL C 2.0 
      Device Type                                     CPU
      Device Profile                                  FULL_PROFILE
      Max compute units                               8
      Max clock frequency                             2800MHz
      Device Partition                                (core)
        Max number of sub-devices                     8
        Supported partition types                     by counts, equally, by names (Intel)
      Max work item dimensions                        3
      Max work item sizes                             8192x8192x8192
      Max work group size                             8192
      Preferred work group size multiple              128
      Max sub-groups per work group                   1
      Preferred / native vector sizes                 
        char                                                 1 / 32      
        short                                                1 / 16      
        int                                                  1 / 8       
        long                                                 1 / 4       
        half                                                 0 / 0        (n/a)
        float                                                1 / 8       
        double                                               1 / 4        (cl_khr_fp64)
      Half-precision Floating-point support           (n/a)
      Single-precision Floating-point support         (core)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 No
        Round to infinity                             No
        IEEE754-2008 fused multiply-add               No
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  No
      Double-precision Floating-point support         (cl_khr_fp64)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 Yes
        Round to infinity                             Yes
        IEEE754-2008 fused multiply-add               Yes
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  No
      Address bits                                    64, Little-Endian
      Global memory size                              12350988288 (11.5GiB)
      Error Correction support                        No
      Max memory allocation                           3087747072 (2.876GiB)
      Unified memory for Host and Device              Yes
      Shared Virtual Memory (SVM) capabilities        (core)
        Coarse-grained buffer sharing                 Yes
        Fine-grained buffer sharing                   Yes
        Fine-grained system sharing                   Yes
        Atomics                                       Yes
      Minimum alignment for any data type             128 bytes
      Alignment of base address                       1024 bits (128 bytes)
      Preferred alignment for atomics                 
        SVM                                           64 bytes
        Global                                        64 bytes
        Local                                         0 bytes
      Max size for global variable                    65536 (64KiB)
      Preferred total size of global vars             65536 (64KiB)
      Global Memory cache type                        Read/Write
      Global Memory cache size                        262144
      Global Memory cache line                        64 bytes
      Image support                                   Yes
        Max number of samplers per kernel             480
        Max size for 1D images from buffer            192984192 pixels
        Max 1D or 2D image array size                 2048 images
        Base address alignment for 2D image buffers   64 bytes
        Pitch alignment for 2D image buffers          64 bytes
        Max 2D image size                             16384x16384 pixels
        Max 3D image size                             2048x2048x2048 pixels
        Max number of read image args                 480
        Max number of write image args                480
        Max number of read/write image args           480
      Max number of pipe args                         16
      Max active pipe reservations                    32767
      Max pipe packet size                            1024
      Local memory type                               Global
      Local memory size                               32768 (32KiB)
      Max constant buffer size                        131072 (128KiB)
      Max number of constant args                     480
      Max size of kernel argument                     3840 (3.75KiB)
      Queue properties (on host)                      
        Out-of-order execution                        Yes
        Profiling                                     Yes
        Local thread execution (Intel)                Yes
      Queue properties (on device)                    
        Out-of-order execution                        Yes
        Profiling                                     Yes
        Preferred size                                4294967295 (4GiB)
        Max size                                      4294967295 (4GiB)
      Max queues on device                            4294967295
      Max events on device                            4294967295
      Prefer user sync for interop                    No
      Profiling timer resolution                      1ns
      Execution capabilities                          
        Run OpenCL kernels                            Yes
        Run native kernels                            Yes
        Sub-group independent forward progress        No
        IL version                                    SPIR-V_1.0
        SPIR versions                                 1.2
      printf() buffer size                            1048576 (1024KiB)
      Built-in kernels                                
      Device Available                                Yes
      Compiler Available                              Yes
      Linker Available                                Yes
      Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint 
    
    NULL platform behavior
      clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
      clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
      clCreateContext(NULL, ...) [default]            No platform
      clCreateContext(NULL, ...) [other]              Success [NV]
      clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
      clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
      clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
      clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
      clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
    	NOTE:	your OpenCL library only supports OpenCL 2.0,
    		but some installed platforms support OpenCL 2.1.
    		Programs using 2.1 features may crash
    		or behave unexepectedly
    

    Just in case someone is stuck with same/similar problem. It is related to the way the memory is handled by the GPU.

    It is necessary to verify if the program is running with the GPU. And perform the read/write operations from gpu memory space. Check https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html. And refer to the flags: CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR,CL_MEM_COPY_HOST_PTR

    //Use GPU Memory if Nvidia GPU Support is enabled
      if(nvidia_support == TRUE){
        inputGPUMemory = cldev_inbuffer;
    
        clEnqueueWriteBuffer(queue, inputGPUMemory, CL_FALSE, 0, size, inputData, 0, NULL, NULL);
      }
      else{
        //Your old stuff
      }
    

    And after the Kernel has processed the information

    //Retrieve buffer if GPU is being used
    if(nvidia_support == TRUE){
        clEnqueueReadBuffer(queue, cldev_outbuffer, CL_TRUE, 0, size, outputData, 0, NULL, NULL);
    }