Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other
Target Operating System
Linux
QNX
other
Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other
SDK Manager Version
1.7.1.8928
other
Host Machine Version
native Ubuntu 18.04
other
Hi,
I am experimenting the concurrent execution of kernel using CUDA stream on Drive AGX Xavier platform.
However, While time profiling the below codes, it is observed that there is no performance improvement(concurrency) using CUDA stream.
With using CUDA stream:
global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}
int main()
{
const int N = 512;
const int num_streams = 2;
cudaStream_t streams[num_streams];
float *data[num_streams];
//Device Memory allocation
for (int i = 0; i < num_streams; i++) {
cudaMalloc(&data[i], N * sizeof(float));
}
//Launch kernel using CUDA stream
for (int i = 1; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
// launch one worker kernel per stream
kernel<<<1, N, 0, streams[i]>>>(data[i], N);
// launch a dummy kernel on the default stream
kernel<<<1, 1>>>(0, 0);
}
//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}
Without using CUDA stream:
global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}
int main()
{
const int N = 512;
const int num_kernels = 2;
float *data[num_kernels];
//Device Memory allocation
for (int i = 0; i < num_kernels; i++) {
cudaMalloc(&data[i], N * sizeof(float));
}
//Launch kernel using CUDA stream
for (int i = 0; i < num_kernels; i++) {
// launch one worker kernel per stream
kernel<<<1, N>>>(data[i], N);
}
//DeviceSynchronize
cudaDeviceSynchronize();
return 0;
}
I referred to the following link for experimentation:
https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
Although the kernel has been distributed to various stream, all the processing seems to be serialised. Could you let me know the if the above flow is correct w.r.t usage of CUDA stream, if not could you point me how to achieve parallelism using CUDA stream.