Nvidia Visual Profiler Not accurate in timing

manuel.lopez · July 29, 2022, 1:51pm

Hello everyone, I’m new to cuda and I’m finding a problem in the execution times displayed by Nvidia Visual Profiler

followed by the suspicion that the kernels were running in parallel even though the visual profiler is not showing it.

The generated code is the following . I am using linux for development ubuntu 20.04

#include “helper_cuda.h”

global void do_work(double data, int N, int idx) {
int i = blockIdx.x * blockDim.x + blockDim.xidx + threadIdx.x;
if (i < N) {
for (int j = 0; j < 20000; j++) {
data[i] = cos(data[i]);
data[i] = sqrt(fabs(data[i]));
}
}
}

int main()
{
int nblocks = 30;
int blocksize = 1024;
double data;
checkCudaErrors(cudaMalloc( (void*)&data, nblocksblocksizesizeof(double) ));

float time;
cudaEvent_t start, stop;
checkCudaErrors(cudaEventCreate(&start));
checkCudaErrors(cudaEventCreate(&stop));
checkCudaErrors(cudaEventRecord(start, 0));
dim3 dimBlock( blocksize, 1, 1 );
dim3 dimGrid( 1, 1, 1 );
for (int i = 0; i < nblocks; i++)
	do_work<<<dimGrid,dimBlock>>>(data, nblocks*blocksize, i);
checkCudaErrors(cudaEventRecord(stop, 0));
checkCudaErrors(cudaEventSynchronize(stop));
checkCudaErrors(cudaEventElapsedTime(&time, start, stop));
printf("Serialised time:  %g ms\n", time);

cudaStream_t streams[nblocks];
for (int i = 0; i < nblocks; i++)
	checkCudaErrors(cudaStreamCreate(&streams[i]));

checkCudaErrors(cudaEventRecord(start, 0));
checkCudaErrors(cudaEventSynchronize(start));
for (int i = 0; i < nblocks; i++)
	do_work<<<dimGrid,dimBlock,0,streams[i]>>>(data, nblocks*blocksize, i);
checkCudaErrors(cudaEventRecord(stop, 0));
checkCudaErrors(cudaEventSynchronize(stop));
checkCudaErrors(cudaEventElapsedTime(&time, start, stop));
printf("Multi-stream parallel time:  %g ms\n", time);

for (int i = 0; i < nblocks; i++)
	checkCudaErrors(cudaStreamDestroy(streams[i]));

checkCudaErrors(cudaFree( data ));
return EXIT_SUCCESS;

}

the output follows

(base) manuel@manuel-DT:~/eclipse-workspace/kernel_ocerlap$ ./kernel_overlap
Serialised time: 3856.09 ms
Multi-stream parallel time: 259.779 ms

As can be seen in the first part, 30 kernels are launched in the same stream and the time is calculated
serialized time 3856ms
However, when 30 streams are generated and a kernel is executed in each one, the total “parallel” time is 259.77ms.

Therefore the execution in parallel since it is shorter than the series must have kernels executing simultaneously

But what it shows in nvprof is that kernels even though they are running on different streams they are not running in parallel

order for analisis
nvprof --analysis-metrics -o nbody-analysis4.nvprof ./kernel_overlap

Please can any one give any indication how to make that the nvprof shows the real execution time

Topic		Replies	Views
Multi GPU results in latencies in Linux CUDA Programming and Performance	4	1896	April 25, 2012
Profiling GPU at source code level CUDA Programming and Performance	4	538	November 9, 2024
Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b CUDA Programming and Performance	0	791	May 9, 2012
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3087	January 19, 2018
8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line CUDA Programming and Performance	5	561	September 15, 2020
kernel runs much faster when being profiled with Visual Profiler Visual Profiler and nvprof	4	4690	August 29, 2014
NV Visual Profiler: No GPU devices in session CUDA Programming and Performance	8	4399	March 11, 2015
Profiling CUDA Programming and Performance	2	827	August 17, 2015
Focused profiling with nvprof not working? Visual Profiler and nvprof	1	1179	May 28, 2020
Concurrent kernel execution without stream CUDA Programming and Performance	7	2458	December 28, 2016

Nvidia Visual Profiler Not accurate in timing

Related topics