Concurrent Kernel executions & Data Transfers

Hi, I am trying to do some image corrections on GPU for “N” number of images. There are 3 kernels that needs to be invoked for each image. Lets say I have 3 kernels(A,B,C). For a particular image, first I have to invoke kernel A & then B & C. Here these kernel executions should be synchronous i.e execution of kernel A should be completed before starting of B & similarly execution of B should be completed before starting C.
I am completely new to this cuda programming & I have started something with the below approach.

Kernels - A,B,C
Images - IMG1, IMG2, IMG3, IMG4, IMG5
I am following the below sequence for the executions
(IMG1 HostToDevice->IMG1 A->IMG1 B->IMG1 C-> IMG1 DeviceToHost->IMG2 HostToDevice->IMG2 A->IMG2 B->IMG C → IMG2 DeviceToHost & so on…).

All the above operations are synchronous & it is taking a lot of time because of this. Now when I searched about concurrent kernel executions & transfers I came to know about streams & I have tried to use cuda streams to make use of concurrent kernel executions & data transfers. But I didn’t see difference related to the times before & after using the streams. here is the code snippet that I am using

int main()
{
  std::vector<cv::Mat> images; // assume that it has 5 images
  int number_of_channels = 3;
  int image_width = 1912;
  int image_height = 1192;
  int original_img_size = sizeof(float) * number_of_channels * image_width * image_height;

  float ** device_original_images;
  float ** device_processed_images;
  float ** host_processed_images;

  // Allocate memory in the CPU
  host_processed_images = (float **)malloc(sizeof(float *) * 5);
  
  // Allocate memory in the GPU
  cudaMalloc(&m_device_original_images, sizeof(float *) * 5);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * 5);

  for(int idx=0;idx<5;++idx)
  {
    host_processed_images[idx]= (float *)malloc(original_img_size);
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), 
    cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), 
    cudaMemcpyHostToDevice);
  }

    int nStreams = 5;    // preparation for streams
    cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
    for(int ii=0;ii<nStreams;ii++){
        cudaStreamCreate(&streams[ii]);
    }

    // copy image data from cpu to gpu asynchronously for all the images
    float ** temp_cpu_pointer_original_images = (float **)malloc(sizeof(float *));
    float ** temp_cpu_pointer_processed_images = (float **)malloc(sizeof(float *));

      //  Allocates block size and grid size
      dim3 threads_per_block(16, 16);
      dim3 blocks_per_grid((int)ceil((image_width)/16) + 1,
                          (int)ceil((image_height)/16) + 1);

    for(int idx=0;idx<5; ++idx)
    {
      temp_cpu_pointer_original_images[idx] = (float *)malloc(sizeof(float));
      temp_cpu_pointer_processed_images[idx] = (float *)malloc(sizeof(float));

      cudaMemcpy(&temp_cpu_pointer_original_images[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
      cudaMemcpy(&temp_cpu_pointer_processed_images[idx], &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    }

    for(int idx=0; idx < 5; ++idx)
    {
      cudaMemcpyAsync(temp_cpu_pointer_original_images[idx], images[idx].data,
                      (sizeof(float) * image_width * image_height * number_of_channels),
                      cudaMemcpyHostToDevice, streams[idx]);
      // Invoke Kernels
      A<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
      B<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
      C<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
     
      cudaMemcpyAsync(m_host_processed_images[idx], temp_cpu_pointer_processed_images[idx],
                (sizeof(float) * image_width * image_height * number_of_channels),
                cudaMemcpyDeviceToHost, streams[idx]);
    }

    for(int idx=0;idx<5;++idx)
    {
      cudaStreamSynchronize(streams[idx]);
    }

    for(int idx=0;idx<5;++idx)
    {
       cv::Mat final_image(image_height, image_width, CV_8UC3,(void *) m_host_processed_images[idx]);
      images[idx] = final_image;
    }

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return;
    }

}

With the above code snippet, I am not seeing any differences in the timings of before & after using of streams. Please let me know If I am doing anything wrong or is there any other way I can do these executions so that overall time will be lesser(something like a change in how I am storing the images or transferring). Thanks in advance!.

hardware being used: GeForce 3060

using “concurrent kernel execution” won’t help with throughput if each kernel saturates the GPU (uses up compute resources). For reasonable sized images, this is likely to be the case.

usage of streams does not automatically guarantee improved performance over non-usage of streams.

To overlap data transfer with kernel execution (cudaMemcpyAsync), you generally must be doing data transfers to or from pinned memory. I don’t see any usage of pinned memory in your posted code.

You can get an orderly treatment of this topic by studying section 7 of this online CUDA training series. The section itself and the homework for that section demonstrate an application that shows overlap of copy and compute. Speaking for myself, I’m unlikely to provide a detailed tutorial here on topics that are already covered there.

Furthermore you can find numerous questions on various forums that cover copy/compute overlap. Many of those will point out the need for using pinned memory.

To verify your own work in this area (concurrency), you should get familiar with using a profiler, such as nsight systems.

Thanks @Robert_Crovella for the help and I tried using the pinned memory for the overlap data transfer. The below code works in such a way that I am allocating the memory in the GPU only once & in a loop I can do the transfers any number of times.
If I iterate for one time, the code is working perfectly fine. But if I iterate more than once(2 or 3 times) then my code is getting crashed with the below output.
“corrupted size vs. prev_size” Aborted

Please help me understand where am I doing wrong?.
Also one thing that I am noticed when I removed the memory allocation for the variable “m_device_processed_images”(I am not using this allocated memory for now, but I need this for other kernel executions) then the code was working perfectly fine for any number of iterations.
Please find the code snippet

//
// Created by adminspin on 14/6/22.
//
#include <bits/stdc++.h>
#include <opencv2/opencv.hpp>
#include <cuda.h>
#include <chrono>
#include <thread>

inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", 
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}


int main(int argc, char** argv)
{
  
  std::vector<cv::Mat> stacks; // assume it has 5 images in it

  int m_stack_size = 5;
  int m_image_width = 1912;
  int m_image_height = 1192;
  int m_number_of_channels = 3;

  int original_img_size = sizeof(float) * m_number_of_channels * m_image_width * m_image_height;

  float ** m_device_original_images;
  float ** m_device_processed_images;
  float ** m_host_images;

  cudaMalloc(&m_device_original_images, sizeof(float *) * m_stack_size);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * m_stack_size);

  checkCuda(cudaMallocHost(&m_host_images, sizeof(float *) * m_stack_size));

  for(int idx=0;idx<m_stack_size;++idx)
  {
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), cudaMemcpyHostToDevice);

    float * g;
    cudaMallocHost(&g, original_img_size);
    m_host_images[idx] = g;
  }

  cudaStream_t *m_streams = (cudaStream_t *)malloc(m_stack_size * sizeof(cudaStream_t));
  
  for(int ii=0;ii<m_stack_size;ii++){
    cudaStreamCreate(&m_streams[ii]);
  }
  // For loop to run the code any number of times 
  for(int num_of_itr=0;num_of_itr<1;num_of_itr++) **// If I make it run for 2 or 3 times, the code is getting crashed**
  {
    float ** temp_device_original_pointer = (float **)malloc(sizeof(float *));

    for(int idx=0;idx<m_stack_size; idx++)
    {
      temp_device_original_pointer[idx] = (float *)malloc(sizeof(float));
      cudaMemcpy(&temp_device_original_pointer[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
    }

    for(int idx =0;idx<m_stack_size;idx++)
    {
      memcpy(m_host_images[idx], stacks[idx].data, (sizeof(float) * m_image_width * m_image_height * m_number_of_channels));
    }

    for(int idx=0; idx < m_stack_size; idx++)
    {
      cudaMemcpyAsync(temp_device_original_pointer[idx], m_host_images[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyHostToDevice, m_streams[idx]);

      cudaMemcpyAsync(m_host_images[idx], temp_device_original_pointer[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyDeviceToHost, m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cudaStreamSynchronize(m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cv::Mat final_image(m_image_height, m_image_width, CV_32FC3,(void *) m_host_images[idx]);
      final_image.convertTo(final_image, CV_8UC3);

      final_image.release();
    }

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return 0;
    }
  }

  for(int idx=0; idx < 5; ++idx)
  {
    float * a = (float *)malloc(sizeof(float));
    cudaMemcpy(&a, &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    float * b = (float *)malloc(sizeof(float));
    cudaMemcpy(&b, &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    cudaFree(a);
    cudaFree(b);
    cudaFreeHost(m_host_images[idx]);
  }

  return 0;

}

When I run the following adaptation of your code, that removes OpenCV dependencies, I don’t have any trouble with it:

$ cat t2206.cu
#include <cuda.h>
#include <chrono>
#include <thread>
#include <vector>
#include <iostream>

inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n",
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}


int main(int argc, char** argv)
{

//  std::vector<cv::Mat> stacks; // assume it has 5 images in it

  int m_stack_size = 5;
  int m_image_width = 1912;
  int m_image_height = 1192;
  int m_number_of_channels = 3;

  int original_img_size = sizeof(float) * m_number_of_channels * m_image_width * m_image_height;

  float ** m_device_original_images;
  float ** m_device_processed_images;
  float ** m_host_images;

  cudaMalloc(&m_device_original_images, sizeof(float *) * m_stack_size);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * m_stack_size);

  checkCuda(cudaMallocHost(&m_host_images, sizeof(float *) * m_stack_size));

  for(int idx=0;idx<m_stack_size;++idx)
  {
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), cudaMemcpyHostToDevice);

    float * g;
    cudaMallocHost(&g, original_img_size);
    m_host_images[idx] = g;
  }

  cudaStream_t *m_streams = (cudaStream_t *)malloc(m_stack_size * sizeof(cudaStream_t));

  for(int ii=0;ii<m_stack_size;ii++){
    cudaStreamCreate(&m_streams[ii]);
  }
  // For loop to run the code any number of times
  for(int num_of_itr=0;num_of_itr<2;num_of_itr++) // If I make it run for 2 or 3 times, the code is getting crashed
  {
    float ** temp_device_original_pointer = (float **)malloc(sizeof(float *));

    for(int idx=0;idx<m_stack_size; idx++)
    {
      temp_device_original_pointer[idx] = (float *)malloc(sizeof(float));
      cudaMemcpy(&temp_device_original_pointer[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
    }
#if 0
    for(int idx =0;idx<m_stack_size;idx++)
    {
      memcpy(m_host_images[idx], stacks[idx].data, (sizeof(float) * m_image_width * m_image_height * m_number_of_channels));
    }
#endif
    for(int idx=0; idx < m_stack_size; idx++)
    {
      cudaMemcpyAsync(temp_device_original_pointer[idx], m_host_images[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyHostToDevice, m_streams[idx]);

      cudaMemcpyAsync(m_host_images[idx], temp_device_original_pointer[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyDeviceToHost, m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cudaStreamSynchronize(m_streams[idx]);
    }
#if 0
    for(int idx=0;idx<m_stack_size;++idx)
    {
      cv::Mat final_image(m_image_height, m_image_width, CV_32FC3,(void *) m_host_images[idx]);
      final_image.convertTo(final_image, CV_8UC3);

      final_image.release();
    }
#endif
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return 0;
    }
  }

  for(int idx=0; idx < 5; ++idx)
  {
    float * a = (float *)malloc(sizeof(float));
    cudaMemcpy(&a, &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    float * b = (float *)malloc(sizeof(float));
    cudaMemcpy(&b, &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    cudaFree(a);
    cudaFree(b);
    cudaFreeHost(m_host_images[idx]);
  }

  return 0;

}
$ nvcc -o t2206 t2206.cu
$ compute-sanitizer ./t2206
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

Therefore I suspect that the problem has to do with your OpenCV usage, and I won’t be able to help with that. OpenCV is not a NVIDIA product or software library.

I also note that “corrupted size vs. prev_size” is a message that may be reported from glibc. I don’t think it has anything to do with CUDA per-se, but instead indicates a problem in host code.