Concurrent Kernel executions & Data Transfers

reddy_v · March 6, 2023, 7:48am

Hi, I am trying to do some image corrections on GPU for “N” number of images. There are 3 kernels that needs to be invoked for each image. Lets say I have 3 kernels(A,B,C). For a particular image, first I have to invoke kernel A & then B & C. Here these kernel executions should be synchronous i.e execution of kernel A should be completed before starting of B & similarly execution of B should be completed before starting C.
I am completely new to this cuda programming & I have started something with the below approach.

Kernels - A,B,C
Images - IMG1, IMG2, IMG3, IMG4, IMG5
I am following the below sequence for the executions
(IMG1 HostToDevice->IMG1 A->IMG1 B->IMG1 C-> IMG1 DeviceToHost->IMG2 HostToDevice->IMG2 A->IMG2 B->IMG C → IMG2 DeviceToHost & so on…).

All the above operations are synchronous & it is taking a lot of time because of this. Now when I searched about concurrent kernel executions & transfers I came to know about streams & I have tried to use cuda streams to make use of concurrent kernel executions & data transfers. But I didn’t see difference related to the times before & after using the streams. here is the code snippet that I am using

int main()
{
  std::vector<cv::Mat> images; // assume that it has 5 images
  int number_of_channels = 3;
  int image_width = 1912;
  int image_height = 1192;
  int original_img_size = sizeof(float) * number_of_channels * image_width * image_height;

  float ** device_original_images;
  float ** device_processed_images;
  float ** host_processed_images;

  // Allocate memory in the CPU
  host_processed_images = (float **)malloc(sizeof(float *) * 5);
  
  // Allocate memory in the GPU
  cudaMalloc(&m_device_original_images, sizeof(float *) * 5);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * 5);

  for(int idx=0;idx<5;++idx)
  {
    host_processed_images[idx]= (float *)malloc(original_img_size);
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), 
    cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), 
    cudaMemcpyHostToDevice);
  }

    int nStreams = 5;    // preparation for streams
    cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
    for(int ii=0;ii<nStreams;ii++){
        cudaStreamCreate(&streams[ii]);
    }

    // copy image data from cpu to gpu asynchronously for all the images
    float ** temp_cpu_pointer_original_images = (float **)malloc(sizeof(float *));
    float ** temp_cpu_pointer_processed_images = (float **)malloc(sizeof(float *));

      //  Allocates block size and grid size
      dim3 threads_per_block(16, 16);
      dim3 blocks_per_grid((int)ceil((image_width)/16) + 1,
                          (int)ceil((image_height)/16) + 1);

    for(int idx=0;idx<5; ++idx)
    {
      temp_cpu_pointer_original_images[idx] = (float *)malloc(sizeof(float));
      temp_cpu_pointer_processed_images[idx] = (float *)malloc(sizeof(float));

      cudaMemcpy(&temp_cpu_pointer_original_images[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
      cudaMemcpy(&temp_cpu_pointer_processed_images[idx], &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    }

    for(int idx=0; idx < 5; ++idx)
    {
      cudaMemcpyAsync(temp_cpu_pointer_original_images[idx], images[idx].data,
                      (sizeof(float) * image_width * image_height * number_of_channels),
                      cudaMemcpyHostToDevice, streams[idx]);
      // Invoke Kernels
      A<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
      B<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
      C<<<blocks_per_grid, threads_per_block, 0, streams[idx]>>>()
     
      cudaMemcpyAsync(m_host_processed_images[idx], temp_cpu_pointer_processed_images[idx],
                (sizeof(float) * image_width * image_height * number_of_channels),
                cudaMemcpyDeviceToHost, streams[idx]);
    }

    for(int idx=0;idx<5;++idx)
    {
      cudaStreamSynchronize(streams[idx]);
    }

    for(int idx=0;idx<5;++idx)
    {
       cv::Mat final_image(image_height, image_width, CV_8UC3,(void *) m_host_processed_images[idx]);
      images[idx] = final_image;
    }

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return;
    }

}

With the above code snippet, I am not seeing any differences in the timings of before & after using of streams. Please let me know If I am doing anything wrong or is there any other way I can do these executions so that overall time will be lesser(something like a change in how I am storing the images or transferring). Thanks in advance!.

hardware being used: GeForce 3060

Robert_Crovella · March 6, 2023, 2:38pm

using “concurrent kernel execution” won’t help with throughput if each kernel saturates the GPU (uses up compute resources). For reasonable sized images, this is likely to be the case.

usage of streams does not automatically guarantee improved performance over non-usage of streams.

To overlap data transfer with kernel execution (cudaMemcpyAsync), you generally must be doing data transfers to or from pinned memory. I don’t see any usage of pinned memory in your posted code.

You can get an orderly treatment of this topic by studying section 7 of this online CUDA training series. The section itself and the homework for that section demonstrate an application that shows overlap of copy and compute. Speaking for myself, I’m unlikely to provide a detailed tutorial here on topics that are already covered there.

Furthermore you can find numerous questions on various forums that cover copy/compute overlap. Many of those will point out the need for using pinned memory.

To verify your own work in this area (concurrency), you should get familiar with using a profiler, such as nsight systems.

reddy_v · March 8, 2023, 10:51am

Thanks @Robert_Crovella for the help and I tried using the pinned memory for the overlap data transfer. The below code works in such a way that I am allocating the memory in the GPU only once & in a loop I can do the transfers any number of times.
If I iterate for one time, the code is working perfectly fine. But if I iterate more than once(2 or 3 times) then my code is getting crashed with the below output.
“corrupted size vs. prev_size” Aborted

Please help me understand where am I doing wrong?.
Also one thing that I am noticed when I removed the memory allocation for the variable “m_device_processed_images”(I am not using this allocated memory for now, but I need this for other kernel executions) then the code was working perfectly fine for any number of iterations.
Please find the code snippet

//
// Created by adminspin on 14/6/22.
//
#include <bits/stdc++.h>
#include <opencv2/opencv.hpp>
#include <cuda.h>
#include <chrono>
#include <thread>

inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", 
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}


int main(int argc, char** argv)
{
  
  std::vector<cv::Mat> stacks; // assume it has 5 images in it

  int m_stack_size = 5;
  int m_image_width = 1912;
  int m_image_height = 1192;
  int m_number_of_channels = 3;

  int original_img_size = sizeof(float) * m_number_of_channels * m_image_width * m_image_height;

  float ** m_device_original_images;
  float ** m_device_processed_images;
  float ** m_host_images;

  cudaMalloc(&m_device_original_images, sizeof(float *) * m_stack_size);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * m_stack_size);

  checkCuda(cudaMallocHost(&m_host_images, sizeof(float *) * m_stack_size));

  for(int idx=0;idx<m_stack_size;++idx)
  {
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), cudaMemcpyHostToDevice);

    float * g;
    cudaMallocHost(&g, original_img_size);
    m_host_images[idx] = g;
  }

  cudaStream_t *m_streams = (cudaStream_t *)malloc(m_stack_size * sizeof(cudaStream_t));
  
  for(int ii=0;ii<m_stack_size;ii++){
    cudaStreamCreate(&m_streams[ii]);
  }
  // For loop to run the code any number of times 
  for(int num_of_itr=0;num_of_itr<1;num_of_itr++) **// If I make it run for 2 or 3 times, the code is getting crashed**
  {
    float ** temp_device_original_pointer = (float **)malloc(sizeof(float *));

    for(int idx=0;idx<m_stack_size; idx++)
    {
      temp_device_original_pointer[idx] = (float *)malloc(sizeof(float));
      cudaMemcpy(&temp_device_original_pointer[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
    }

    for(int idx =0;idx<m_stack_size;idx++)
    {
      memcpy(m_host_images[idx], stacks[idx].data, (sizeof(float) * m_image_width * m_image_height * m_number_of_channels));
    }

    for(int idx=0; idx < m_stack_size; idx++)
    {
      cudaMemcpyAsync(temp_device_original_pointer[idx], m_host_images[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyHostToDevice, m_streams[idx]);

      cudaMemcpyAsync(m_host_images[idx], temp_device_original_pointer[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyDeviceToHost, m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cudaStreamSynchronize(m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cv::Mat final_image(m_image_height, m_image_width, CV_32FC3,(void *) m_host_images[idx]);
      final_image.convertTo(final_image, CV_8UC3);

      final_image.release();
    }

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return 0;
    }
  }

  for(int idx=0; idx < 5; ++idx)
  {
    float * a = (float *)malloc(sizeof(float));
    cudaMemcpy(&a, &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    float * b = (float *)malloc(sizeof(float));
    cudaMemcpy(&b, &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    cudaFree(a);
    cudaFree(b);
    cudaFreeHost(m_host_images[idx]);
  }

  return 0;

}

Robert_Crovella · March 8, 2023, 2:57pm

When I run the following adaptation of your code, that removes OpenCV dependencies, I don’t have any trouble with it:

$ cat t2206.cu
#include <cuda.h>
#include <chrono>
#include <thread>
#include <vector>
#include <iostream>

inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n",
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}


int main(int argc, char** argv)
{

//  std::vector<cv::Mat> stacks; // assume it has 5 images in it

  int m_stack_size = 5;
  int m_image_width = 1912;
  int m_image_height = 1192;
  int m_number_of_channels = 3;

  int original_img_size = sizeof(float) * m_number_of_channels * m_image_width * m_image_height;

  float ** m_device_original_images;
  float ** m_device_processed_images;
  float ** m_host_images;

  cudaMalloc(&m_device_original_images, sizeof(float *) * m_stack_size);
  cudaMalloc(&m_device_processed_images, sizeof(float *) * m_stack_size);

  checkCuda(cudaMallocHost(&m_host_images, sizeof(float *) * m_stack_size));

  for(int idx=0;idx<m_stack_size;++idx)
  {
    float * a;
    cudaMalloc(&a, original_img_size);
    cudaMemcpy(&m_device_original_images[idx], &a, sizeof(float *), cudaMemcpyHostToDevice);

    float * b;
    cudaMalloc(&b, original_img_size);
    cudaMemcpy(&m_device_processed_images[idx], &b, sizeof(float *), cudaMemcpyHostToDevice);

    float * g;
    cudaMallocHost(&g, original_img_size);
    m_host_images[idx] = g;
  }

  cudaStream_t *m_streams = (cudaStream_t *)malloc(m_stack_size * sizeof(cudaStream_t));

  for(int ii=0;ii<m_stack_size;ii++){
    cudaStreamCreate(&m_streams[ii]);
  }
  // For loop to run the code any number of times
  for(int num_of_itr=0;num_of_itr<2;num_of_itr++) // If I make it run for 2 or 3 times, the code is getting crashed
  {
    float ** temp_device_original_pointer = (float **)malloc(sizeof(float *));

    for(int idx=0;idx<m_stack_size; idx++)
    {
      temp_device_original_pointer[idx] = (float *)malloc(sizeof(float));
      cudaMemcpy(&temp_device_original_pointer[idx], &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);
    }
#if 0
    for(int idx =0;idx<m_stack_size;idx++)
    {
      memcpy(m_host_images[idx], stacks[idx].data, (sizeof(float) * m_image_width * m_image_height * m_number_of_channels));
    }
#endif
    for(int idx=0; idx < m_stack_size; idx++)
    {
      cudaMemcpyAsync(temp_device_original_pointer[idx], m_host_images[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyHostToDevice, m_streams[idx]);

      cudaMemcpyAsync(m_host_images[idx], temp_device_original_pointer[idx],
                      (sizeof(float) * m_image_width * m_image_height * m_number_of_channels),
                      cudaMemcpyDeviceToHost, m_streams[idx]);
    }

    for(int idx=0;idx<m_stack_size;++idx)
    {
      cudaStreamSynchronize(m_streams[idx]);
    }
#if 0
    for(int idx=0;idx<m_stack_size;++idx)
    {
      cv::Mat final_image(m_image_height, m_image_width, CV_32FC3,(void *) m_host_images[idx]);
      final_image.convertTo(final_image, CV_8UC3);

      final_image.release();
    }
#endif
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
        return 0;
    }
  }

  for(int idx=0; idx < 5; ++idx)
  {
    float * a = (float *)malloc(sizeof(float));
    cudaMemcpy(&a, &m_device_original_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    float * b = (float *)malloc(sizeof(float));
    cudaMemcpy(&b, &m_device_processed_images[idx], sizeof(float *), cudaMemcpyDeviceToHost);

    cudaFree(a);
    cudaFree(b);
    cudaFreeHost(m_host_images[idx]);
  }

  return 0;

}
$ nvcc -o t2206 t2206.cu
$ compute-sanitizer ./t2206
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

Therefore I suspect that the problem has to do with your OpenCV usage, and I won’t be able to help with that. OpenCV is not a NVIDIA product or software library.

I also note that “corrupted size vs. prev_size” is a message that may be reported from glibc. I don’t think it has anything to do with CUDA per-se, but instead indicates a problem in host code.

Topic		Replies	Views
cudaMemcpy() returns success and copy incorrect data CUDA Programming and Performance	3	2162	March 4, 2017
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	37	January 20, 2025
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16313	September 29, 2017
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	581	April 6, 2023
Can't get any concurrency on simple vector add across multi-GPU and streams CUDA Programming and Performance	17	5763	April 28, 2012
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2634	August 2, 2010
Problem using cuda streams CUDA Programming and Performance	13	2006	March 4, 2015
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	516	May 13, 2018
Multi stream multi GPU CUDA Programming and Performance cuda	9	1158	October 6, 2023
Cuda code performance CUDA Programming and Performance	14	3153	December 16, 2014

Concurrent Kernel executions & Data Transfers

Related topics