No performance improvement using CUDA stream

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.7.1.8928
other

Host Machine Version
native Ubuntu 18.04
other

Hi,

I am experimenting the concurrent execution of kernel using CUDA stream on Drive AGX Xavier platform.

However, While time profiling the below codes, it is observed that there is no performance improvement(concurrency) using CUDA stream.

With using CUDA stream:

global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int N = 512;
const int num_streams = 2;
cudaStream_t streams[num_streams];
float *data[num_streams];

//Device Memory allocation
for (int i = 0; i < num_streams; i++) {
    cudaMalloc(&data[i], N * sizeof(float));
}

//Launch kernel using CUDA stream

for (int i = 1; i < num_streams; i++) {
  cudaStreamCreate(&streams[i]);

  // launch one worker kernel per stream
  kernel<<<1, N, 0, streams[i]>>>(data[i], N);

  // launch a dummy kernel on the default stream
  kernel<<<1, 1>>>(0, 0);
}

//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}


Without using CUDA stream:

global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int N = 512;
const int num_kernels = 2;
float *data[num_kernels];

//Device Memory allocation
for (int i = 0; i < num_kernels; i++) {
    cudaMalloc(&data[i], N * sizeof(float));
}

//Launch kernel using CUDA stream
for (int i = 0; i < num_kernels; i++) {

    // launch one worker kernel per stream
    kernel<<<1, N>>>(data[i], N);
}

//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}

I referred to the following link for experimentation:
https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Although the kernel has been distributed to various stream, all the processing seems to be serialised. Could you let me know the if the above flow is correct w.r.t usage of CUDA stream, if not could you point me how to achieve parallelism using CUDA stream.

Dear @jyotsna.patil,
Could you check if Programming Guide :: CUDA Toolkit Documentation helps.

Hi @SivaRamaKrishnaNV ,

Checked the above mentioned link for CUDA toolkit.
However in section [3.2.5.5. Streams] it is mentioned that, even though we create the separate stream for each kernel and launch them in parallel, the concurrency behavior is not guaranteed/ it is undefined.

3.2.5.5. Streams

Does CUDA stream only help to overlap the data transfer with kernel execution and not the concurrent execution of kernels ?

Hi @SivaRamaKrishnaNV , I checked the device properties also using deviceQuery sample and is shows the the no of copy engine available is only one (Refer the below snapshot). Does it mean the device doesn’t support overlap of H2D/D2H copy and kernel execution ?

image

Dear @jyotsna.patil,
It means only one copy can be overlapped with kernel execution.

Two kernels can execute in parallel only if there are enough resources available (like free SMs on GPU available for executing CUDA kernel blocks).

You may move cuda stream creation calls outside for loop and remove kernel<<<1, 1>>>(0, 0) in code snippet and check for experiment.

Also, please share complete deviceQuery output

Hi @SivaRamaKrishnaNV ,

Experimented as suggested: move cuda stream creation calls outside for loop and remove kernel<<<1, 1>>>(0, 0) in code snippet.

But this didn’t show any performance improvement.
deviceQuery_output.txt (2.3 KB)

Please find the attached deviceQuery output.

Dear @jyotsna.patil,
Could you increase the value of N? It is possible that the kernel launch overhead time is more than the kernel execution time. In such cases, the first kernel gets finished before the second kernel call gets launched.

Hi @SivaRamaKrishnaNV ,

Experimented with higher value of N. However, the profiling time is similar in both the scenarios. I’m using C++ Chrono library API’s for profiling the application.

Request you to suggest, better tool to profile/visualise cuda stream behaviour.

Dear @jyotsna.patil,
You can use NVIDIA Nsight Systems for visulization of application behavior. Also, we recommend using CUDA Events to measure timing in CUDA sample.

Hi @SivaRamaKrishnaNV ,

I profiled the application using CUDA event(create,record,synchronize) and there is no much difference in profiled results.

Also I was trying to use the NVIDIA visual Profiler tool to get the timeline view (on Host machine with Compute capability : 7.5) of CUDA application and it observed the version compatibility issue with tool. I have attached the snapshot for reference. Could you let me know if any additional steps are required to configure the Visual Profiler and if below mentioned versions are compatible ?
nvvp
.
Rererred link for Visual Profiler tool:

  • Linux Host Machine
  • NVIDIA Visual Profiler version : 9.1
  • Compute capability : 7.5
  • nvprof --version : Release version 10.2.311 (21)

Dear @jyotsna.patil,
Could you attach your complete sample(.cu file) for reproduce the timing issue.

Dear @jyotsna.patil,
Could you provide an update?