Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b

mig99d · May 9, 2012, 3:15pm

Hi all,

I’m currently doing some research on automatic parallelization using CUDA, and I’ve been trying to implement some ideas manually into a test program to later integrate into a compiler. I’ve recently hit some kind of performance bottleneck using multiple GPUs that I unfortunately am having problem investigate due to some problem in the Visual Profiler tools.

First, I’ll describe my test program. It is a simple stencil computation, I have a matrix A which gets updated through several iterations. On each iteration, each value in the middle of matrix A is set to the average of its direct neighbors, as in
A’[x,y] = 0.25*(A[x+1,y] + A[x-1,y] + A[x,y+1] + A[x,y-1])
and the edges are just copied.

The matrix computation is divided “horizontally” equally among some number of gpus (i have access to a machine with 8 GTX 580 3GB GPUs). Memory for the full matrix is allocated on each device, and I’m reusing memory across iterations to avoid cudaMalloc/free overheads.

Each iteration is computed using 5 kernels, one to do the average in the middle, and the 4 others to copy the edges (top, left, right, bottom). To compute the inside of the matrix, each GPU needs therefore to copy one row of data from each of its neighbors, that is one copy on the first and last GPU, and two copies for any others. To avoid having the GPUs wait on each other for the missing data, I’m placing the copy of the missing rows and computation of corresponding values in the same stream, different from the rest. So, in order to compute the inside of the matrix there are 2 or 3 streams, 1 computing iteration space for which the data was available (most of it), and 1 or 2 others that copy a missing row and compute the corresponding values. I’m using cudamemcpyPeerAsync() for the copying. I also use cudaDeviceSynchronize() after each iteration.

With this program, I’m measuring execution times for the same size and number of iterations using a varying number of GPUs. I see significantly less than linear improvements by increasing the number of GPUs. Using 5 GPUs or more, the speedup is negligible. I’d like to figure out what the bottleneck is. I’m currently using a matrix of 500M single precision floating point units and 10000 iterations.

Sadly, Visual Profiler does not seem to be working with multiple GPUs. The timeline itself extends to 1000000s(!) even though the program runs in less than 3 minutes. I only see one massive cudamemcpyPeerAsync() block per device on the timeline, no matter how much I zoom in. Anywhere I click on it I see on the properties pane that its duration is all the way to the end of the massive block, but the start is very close to where I clicked. It’s as though the end of each cudamemcpyPeerAsync() instruction is not being recorded, and they’re all overlapping. The per-Thread API calls are mostly missing, and the kernel launches for each devices are very far apart in time from those of the other devices, even though they should be together since I’m synchronizing on each iteration. I’m attaching a screenshot to illustrate this. The low compute/memcpy overlap and low memcpy throughput are expected, but you can see that even there the total time is negative(!).
Interestingly, using the profiler with a single GPU shows expected output.

I’m using the cuda toolkit version 4.1 on a linux x86_64 host. The host is on a shared cluster for which I don’t have admin access, so upgrading to version 4.2 might be an issue. I would appreciate any feedback in either getting the profiler to work or insights into why my application is not scaling well.

Best regards,
Miguel

Topic		Replies	Views
Nvidia Visual Profiler Not accurate in timing Visual Profiler and nvprof cuda	0	773	July 29, 2022
Multi GPU results in latencies in Linux CUDA Programming and Performance	4	1895	April 25, 2012
"Unified Memory Profiling is not supported ..." warning 3348 Visual Profiler and nvprof	15	5744	September 20, 2018
CUDA Visual Profiler Warning (dropped rows) CUDA Programming and Performance	10	21615	October 7, 2011
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3937	October 19, 2011
Gap between some thread calls CUDA Programming and Performance	6	1267	October 30, 2014
Optimisation using Visual profiler Some guess I would like to discuss with you CUDA Programming and Performance	5	1616	April 10, 2012
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3390	May 10, 2017
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008
Compute Visual Profiler # runs Compute Visual Profiler on CUDA CUDA Programming and Performance	2	1890	March 7, 2011

Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b

Related topics