For my application, I am running the cusparseScsrmm command 10 times where csrValA, csrRowPtrA, and csrColIndA varies, and B stays the same ( see http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm for reference).
When I run all the cusparseScsrmm calculations on the first GPU of my K80, each cusparseScsrmm takes approximately 5 seconds to run. If I split the 10 calculations across the two GPUs in the K80 and call cusparseScsrmm in parallel, each call to cusparseScsrmm starts taking 10 seconds. Ideally, the cusparseScsrmm calculations would remain taking 5 seconds and the 10 calculations would finish in 25 seconds.
Any suggestions on what I could try to make each of the calls to cusparseScsrmm to take only 5 seconds in the dual GPU configuration?
Thank you for the help.