Help with boosting performance on dual GPU

I have access to a dual GPU GEForce Titan Z and I am attempting to make an existing single GPU code work on both GPUs and get a significant performance boost. I got the code to work on dual GPUs and give consistent results with the single GPU version, but I am not getting any performance boost and I was hoping I could get some advice since I am new to multi-gpu programming.

Unfortunately I cannot split the problem into two independent parts to run on each GPU and collate the results at the end.

To give an overview, the code takes an array of data and every iteration performs two algorithms. The first algorithm computes a number for each element and the second algorithm updates weights for the next iteration. Furthermore, for the first algorithm it is possible to divide the array into two halves that are completely independent, but the second algorithm for weight computation each element could potentially depend on any other element in the array, so the halves are not independent.

For my dual GPU implementation, I use a single CPU thread and iterate over the devices. I used pinned memory for the data and asynchronous copies. I split the main data array between the two devices.

for each iteration {

for (device = 0; device < 2; device++) {
call kernel_1<<< >>>(d_a[i])

for (device = 0; device < 2; device++) {
async copy my half data from device to host

synchronize both devices

for (device = 0; device < 2; device++) {
async copy other half data from host back to device

for (device = 0; device < 2; device++) {
call kernel_2<<< >>>(d_a[i])

} // end for each iteration

I will attempt to speed this up using peer-to-peer copy between devices, but currently this isn’t working (failed NVidia P2P test) and I filled a bug report with NVidia.

But does anyone know of a better way to go about this type of problem? Could I use zero-copy for the host data array, or would this not be ideal for this kind of problem?


A sense of where the time is being spent will help guide your efforts. Fundamentally you are trying to optimize your code (using a particular method: multi-GPU) but you may be flying blind.

I would start either with the single GPU case or your current dual-GPU implementation, and run the visual profiler to get a sense of where the time is being spent. Are the kernels taking a long time to execute, or are they very short? Is most of the time being spent on kernel execution? Or is most of the time being spent on data copy/data movement?

Then, in the dual-GPU case, if your first kernel (which can be easily separated between 2 GPUs) is using significant computation time, then use the visual profiler to confirm that your attempt to get the 2 GPUs to run concurrently is actually working.

I did some profiling of the dual GPU case and found most of the time is being spent on kernel execution for each iteration. Also, I did confirm that the kernels do run concurrently. It looks like the memory transfer is hurting performance the most.

Try using peer-to-peer transfers instead of going through host memory.