addendum 2:
I placed a copy of libcupti.so.10.2 in the working directory. I still got the “not found” message, but nvprof ran. I have some questions about the output.
[Note: I worked hard with C++ classes to force the heavy work with arrays to remain on the GPUs without copying back and forth (the final results only depend upon a sampling of the arrays. I ONLY got it to work under tesla:managed; without the managed keyword, the arrays all ended up zero.]
First, I ran the time command on the executable and got this:
35.025u 2.510s 0:37.89 99.0% 0+0k 0+16io 21385pf+0w
At the end of nvprof, I got this:
==11536== Unified Memory profiling result:
Device “GeForce GT 710 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
366 5.0596KB 4.0000KB 12.000KB 1.808594MB 1.121408ms Host To Device
43353 35.177KB 4.0000KB 0.9961MB 1.454407GB 462.0970ms Device To Host
Total CPU Page faults: 21378
This at least suggests that I have minimized the GPU-CPU copying to less than one second out of 37.
However, in the profiling table, I found this:
==11536== Profiling application: SiteRepair --print-gpu-trace
==11536== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 49.97% 17.2702s 5001 3.4533ms 3.4384ms 3.5945ms EulerIntegrationMethod::Increment_636_gpu(void)
27.07% 9.35680s 5001 1.8710ms 1.8447ms 2.0914ms EulerIntegrationMethod::GetLaplacians_585_gpu(void)
11.65% 4.02709s 5001 805.26us 801.83us 810.82us EulerIntegrationMethod::GetLaplacians_571_gpu(void)
10.81% 3.73508s 5001 746.87us 442.64us 749.48us IntegrationMethod::ManageSites_482_gpu(double)
0.48% 166.67ms 5001 33.327us 31.390us 35.902us EulerIntegrationMethod::Increment_681_gpu(void)
0.02% 8.1630ms 1 8.1630ms 8.1630ms 8.1630ms IntegrationMethod::Initialize_422_gpu(void)
0.00% 6.2080us 1 6.2080us 6.2080us 6.2080us IntegrationMethod::Initialize_460_gpu(void)
…followed by this:
API calls: 93.39% 34.7281s 25007 1.3887ms 3.5960us 8.1614ms cuEventSynchronize
5.65% 2.10217s 25007 84.063us 18.405us 1.6821ms cuLaunchKernel
0.33% 122.13ms 50014 2.4410us 1.3520us 243.89us cuEventRecord
0.24% 90.436ms 1 90.436ms 90.436ms 90.436ms cuMemAllocManaged
@
Since the entire run time was about 37 seconds, the combination of GPU activities plus API calls adds to more than this (almost double). Is there overcounting here?
Bottom line: if I acquired a graphics card with 6x the CUDA cores of what I have now, might I expect roughly a 6x overall speedup in a specific code like this? I don’t know if something is missing from the profiler output associated with the missing library error message.