I’m running code which execute’s cusparse gtsv over 8 gpus. Problem being the code does not seem to be asynchronous, as the execution time is 8 * that of a single execution. Can someone indicate the best way to parallelise the code over all 8 gpus most efficiently? I’ve tried multithreading but it seems a bit fragile with a large number of cusparse handles.
Cheers