I’m currently doing some research on automatic parallelization using CUDA, and I’ve been trying to implement some ideas manually into a test program to later integrate into a compiler. I’ve recently hit some kind of performance bottleneck using multiple GPUs that I unfortunately am having problem investigate due to some problem in the Visual Profiler tools.
First, I’ll describe my test program. It is a simple stencil computation, I have a matrix A which gets updated through several iterations. On each iteration, each value in the middle of matrix A is set to the average of its direct neighbors, as in
A’[x,y] = 0.25*(A[x+1,y] + A[x-1,y] + A[x,y+1] + A[x,y-1])
and the edges are just copied.
The matrix computation is divided “horizontally” equally among some number of gpus (i have access to a machine with 8 GTX 580 3GB GPUs). Memory for the full matrix is allocated on each device, and I’m reusing memory across iterations to avoid cudaMalloc/free overheads.
Each iteration is computed using 5 kernels, one to do the average in the middle, and the 4 others to copy the edges (top, left, right, bottom). To compute the inside of the matrix, each GPU needs therefore to copy one row of data from each of its neighbors, that is one copy on the first and last GPU, and two copies for any others. To avoid having the GPUs wait on each other for the missing data, I’m placing the copy of the missing rows and computation of corresponding values in the same stream, different from the rest. So, in order to compute the inside of the matrix there are 2 or 3 streams, 1 computing iteration space for which the data was available (most of it), and 1 or 2 others that copy a missing row and compute the corresponding values. I’m using cudamemcpyPeerAsync() for the copying. I also use cudaDeviceSynchronize() after each iteration.
With this program, I’m measuring execution times for the same size and number of iterations using a varying number of GPUs. I see significantly less than linear improvements by increasing the number of GPUs. Using 5 GPUs or more, the speedup is negligible. I’d like to figure out what the bottleneck is. I’m currently using a matrix of 500M single precision floating point units and 10000 iterations.
Sadly, Visual Profiler does not seem to be working with multiple GPUs. The timeline itself extends to 1000000s(!) even though the program runs in less than 3 minutes. I only see one massive cudamemcpyPeerAsync() block per device on the timeline, no matter how much I zoom in. Anywhere I click on it I see on the properties pane that its duration is all the way to the end of the massive block, but the start is very close to where I clicked. It’s as though the end of each cudamemcpyPeerAsync() instruction is not being recorded, and they’re all overlapping. The per-Thread API calls are mostly missing, and the kernel launches for each devices are very far apart in time from those of the other devices, even though they should be together since I’m synchronizing on each iteration. I’m attaching a screenshot to illustrate this. The low compute/memcpy overlap and low memcpy throughput are expected, but you can see that even there the total time is negative(!).
Interestingly, using the profiler with a single GPU shows expected output.
I’m using the cuda toolkit version 4.1 on a linux x86_64 host. The host is on a shared cluster for which I don’t have admin access, so upgrading to version 4.2 might be an issue. I would appreciate any feedback in either getting the profiler to work or insights into why my application is not scaling well.