No performance improvement using CUDA stream

jyotsna.patil · March 3, 2022, 12:25pm

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.7.1.8928
other

Host Machine Version
native Ubuntu 18.04
other

Hi,

I am experimenting the concurrent execution of kernel using CUDA stream on Drive AGX Xavier platform.

However, While time profiling the below codes, it is observed that there is no performance improvement(concurrency) using CUDA stream.

With using CUDA stream:

global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int N = 512;
const int num_streams = 2;
cudaStream_t streams[num_streams];
float *data[num_streams];

//Device Memory allocation
for (int i = 0; i < num_streams; i++) {
    cudaMalloc(&data[i], N * sizeof(float));
}

//Launch kernel using CUDA stream

for (int i = 1; i < num_streams; i++) {
  cudaStreamCreate(&streams[i]);

  // launch one worker kernel per stream
  kernel<<<1, N, 0, streams[i]>>>(data[i], N);

  // launch a dummy kernel on the default stream
  kernel<<<1, 1>>>(0, 0);
}

//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}

Without using CUDA stream:

global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int N = 512;
const int num_kernels = 2;
float *data[num_kernels];

//Device Memory allocation
for (int i = 0; i < num_kernels; i++) {
    cudaMalloc(&data[i], N * sizeof(float));
}

//Launch kernel using CUDA stream
for (int i = 0; i < num_kernels; i++) {

    // launch one worker kernel per stream
    kernel<<<1, N>>>(data[i], N);
}

//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}

I referred to the following link for experimentation:
https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Although the kernel has been distributed to various stream, all the processing seems to be serialised. Could you let me know the if the above flow is correct w.r.t usage of CUDA stream, if not could you point me how to achieve parallelism using CUDA stream.

SivaRamaKrishnaNV · March 7, 2022, 3:15am

Dear @jyotsna.patil,
Could you check if Programming Guide :: CUDA Toolkit Documentation helps.

jyotsna.patil · March 7, 2022, 6:48am

Hi @SivaRamaKrishnaNV ,

Checked the above mentioned link for CUDA toolkit.
However in section [3.2.5.5. Streams] it is mentioned that, even though we create the separate stream for each kernel and launch them in parallel, the concurrency behavior is not guaranteed/ it is undefined.

3.2.5.5. Streams

Does CUDA stream only help to overlap the data transfer with kernel execution and not the concurrent execution of kernels ?

jyotsna.patil · March 7, 2022, 8:22am

Hi @SivaRamaKrishnaNV , I checked the device properties also using deviceQuery sample and is shows the the no of copy engine available is only one (Refer the below snapshot). Does it mean the device doesn’t support overlap of H2D/D2H copy and kernel execution ?

SivaRamaKrishnaNV · March 7, 2022, 9:33am

Dear @jyotsna.patil,
It means only one copy can be overlapped with kernel execution.

Two kernels can execute in parallel only if there are enough resources available (like free SMs on GPU available for executing CUDA kernel blocks).

You may move cuda stream creation calls outside for loop and remove kernel<<<1, 1>>>(0, 0) in code snippet and check for experiment.

Also, please share complete deviceQuery output

jyotsna.patil · March 7, 2022, 10:02am

Hi @SivaRamaKrishnaNV ,

Experimented as suggested: move cuda stream creation calls outside for loop and remove kernel<<<1, 1>>>(0, 0) in code snippet.

But this didn’t show any performance improvement.
deviceQuery_output.txt (2.3 KB)

Please find the attached deviceQuery output.

SivaRamaKrishnaNV · March 7, 2022, 4:56pm

Dear @jyotsna.patil,
Could you increase the value of N? It is possible that the kernel launch overhead time is more than the kernel execution time. In such cases, the first kernel gets finished before the second kernel call gets launched.

jyotsna.patil · March 8, 2022, 2:33pm

Hi @SivaRamaKrishnaNV ,

Experimented with higher value of N. However, the profiling time is similar in both the scenarios. I’m using C++ Chrono library API’s for profiling the application.

Request you to suggest, better tool to profile/visualise cuda stream behaviour.

SivaRamaKrishnaNV · March 10, 2022, 7:00am

Dear @jyotsna.patil,
You can use NVIDIA Nsight Systems for visulization of application behavior. Also, we recommend using CUDA Events to measure timing in CUDA sample.

jyotsna.patil · March 10, 2022, 7:50am

Hi @SivaRamaKrishnaNV ,

I profiled the application using CUDA event(create,record,synchronize) and there is no much difference in profiled results.

Also I was trying to use the NVIDIA visual Profiler tool to get the timeline view (on Host machine with Compute capability : 7.5) of CUDA application and it observed the version compatibility issue with tool. I have attached the snapshot for reference. Could you let me know if any additional steps are required to configure the Visual Profiler and if below mentioned versions are compatible ?
nvvp
.
Rererred link for Visual Profiler tool:

Linux Host Machine
NVIDIA Visual Profiler version : 9.1
Compute capability : 7.5
nvprof --version : Release version 10.2.311 (21)

SivaRamaKrishnaNV · March 10, 2022, 9:12am

Dear @jyotsna.patil,
Could you attach your complete sample(.cu file) for reproduce the timing issue.

SivaRamaKrishnaNV · March 22, 2022, 2:37pm

Dear @jyotsna.patil,
Could you provide an update?

Topic		Replies	Views
CUDA very slow performance CUDA Programming and Performance	21	16752	March 6, 2020
Cannot get any stream parallelism. CUDA Programming and Performance	13	1297	December 31, 2019
performance problem CUDA Programming and Performance	2	611	July 16, 2018
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	830	April 7, 2024
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	800	September 11, 2016
Streams concurrency bad performance CUDA Programming and Performance	3	2029	June 13, 2012
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6031	April 2, 2013
CUDA runtime version 0.0 CUDA Programming and Performance	2	1664	May 23, 2011
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2126	February 5, 2020
Streams Problem CUDA Programming and Performance	2	4658	December 7, 2008

No performance improvement using CUDA stream

3.2.5.5. Streams

Related topics