csrgemm csrgemm2

Hi there,
I’m using cusparse csrgemm2 to multiply 2 CSR matrices. I ran a very simple test with a 43 matrix(A) of the form AA_transform. There are 2 things I wanted to confirm.

  1. How can I check that csrgemm2 is using more than one threads?

  2. cudaFree seems to take 750 milliseconds. It doesnt seem right. I understand that cudaFree may use internal locks etc - but 750 ms is still quite large. I think I’m doing something wrong.

Any ideas guidance, RTFM with relevant sections would be great!

Thanks
H

use a profiler. For example,

nvprof --print-gpu-trace ./my_app

will show you any and all kernels launched, including the number of blocks and threads associated with each launch

Worrying about something like this to me is misplaced, or suggests that you may want to rethink your app design.

If you’re doing cudaFree at the end of your app, who cares? If the difference in run time of 750 milliseconds is important to your overall performance scenario, your problem is too small to be interesting on GPUs.

If you’re doing cudaFree in a perf-sensitive loop (where this kind of thing could add up) then that is the as good as any reason I can think of to not do that. If you’re doing cudaFree in a loop, you are almost certainly doing some kind of allocation (e.g. cudaMalloc) in that loop also. And that will cost you too. So don’t allocate/free/allocate/free/allocate/free

Allocate once, at the beginning of your app, then reuse your allocations.

Thanks Robert!

w.r.t. to the app design. The real matrices we need to multiply are order (20M, 30M) and (30M, 10K). The (4,3) matrix was a just a “hello world” since we are CUDA newbies and trying to learn.

Also, i’m not calling cudaFree. csrgemm2 is calling it. It would be mighty nice to call cudaFree ourselves, because like you suggested, we will do it outside the perf-loop.

In our situation, the bigger matrix is static - so we are indeed planning to keep it on the device throughout the app runtime.

I’d appreciate your thoughts!
H

Then there is not much you can do.

You can file a bug at developer.nvidia.com

The instructions are linked to a sticky post at the top of the CUDA programming sub-forum.