Just a quick question: the output given by nvprof in summary mode, i.e.

==27694== Profiling application: matrixMul
==27694== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.94%  1.11524s       301  3.7051ms  3.6928ms  3.7174ms  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
  0.04%  406.30us         2  203.15us  136.13us  270.18us  [CUDA memcpy HtoD]
  0.02%  248.29us         1  248.29us  248.29us  248.29us  [CUDA memcpy DtoH]

‘Time’ corresponds to gputime rather than cputime, correct? They don’t say in the profiler user’s guide.

Correct - it corresponds to “gputime” - which gives the kernel execution time.