I’ve been running some tests running our algorithm across two (or more) GPUs. This uses 2 Tesla C1060 boards and OpenCL 1.1 beta. I first distributed the job using 2 devices, but 1 context. Kernels were launched in two different threads. My performance looks like this (Centos 5.4, Driver 258.19):
However, when I created 2 contexts in 2 threads and ran the same code, my performance looks like this:
I looked through the OpenCL 1.1 spec, but I couldn’t find any suggestions as to how this code should be written. Anyone have experiences they would like to share? Clearly, the multiple context route seems to be best.