I have access to a dual GPU GEForce Titan Z and I am attempting to make an existing single GPU code work on both GPUs and get a significant performance boost. I got the code to work on dual GPUs and give consistent results with the single GPU version, but I am not getting any performance boost and I was hoping I could get some advice since I am new to multi-gpu programming.
Unfortunately I cannot split the problem into two independent parts to run on each GPU and collate the results at the end.
To give an overview, the code takes an array of data and every iteration performs two algorithms. The first algorithm computes a number for each element and the second algorithm updates weights for the next iteration. Furthermore, for the first algorithm it is possible to divide the array into two halves that are completely independent, but the second algorithm for weight computation each element could potentially depend on any other element in the array, so the halves are not independent.
For my dual GPU implementation, I use a single CPU thread and iterate over the devices. I used pinned memory for the data and asynchronous copies. I split the main data array between the two devices.
for each iteration {
for (device = 0; device < 2; device++) {
call kernel_1<<< >>>(d_a[i])
}
for (device = 0; device < 2; device++) {
async copy my half data from device to host
}
synchronize both devices
for (device = 0; device < 2; device++) {
async copy other half data from host back to device
}
for (device = 0; device < 2; device++) {
call kernel_2<<< >>>(d_a[i])
}
} // end for each iteration
I will attempt to speed this up using peer-to-peer copy between devices, but currently this isn’t working (failed NVidia P2P test) and I filled a bug report with NVidia.
But does anyone know of a better way to go about this type of problem? Could I use zero-copy for the host data array, or would this not be ideal for this kind of problem?
Thanks