Columns are execution time, in milliseconds, 200 kernel runs on a 512MB buffer of floats.
Cuda memcheck (10.0.130) doesn’t report any errors.
I’ve experimented a little bit more and managed to get profilers to work with 432.00 driver. I don’t know how, maybe there was something wrong with my setup, and some experiment caused it to “fix itself”.
Anyway, the Visual Profiler 10.1 manages to run the test app just fine, however it reports a warning
“==21264== Warning: CDP tracing and profiling are not supported on devices with compute capability 7.0 and later.” and also there are no kernels visible on the GPU timeline, only memory copies. The CPU timeline shows some calls like cudaDeviceSynchronize, cudaMemcpy, cudaMalloc, but no kernel launches.
Nsight Compute 2019.5.0 shows some data. I’ve glanced over the metrices that are available, and I’m attaching a few. Should I upload the whole csv somewhere?
A Grid Size
B dram__bytes_read.sum.pct_of_peak_sustained_elapsed [%]
C dram__bytes_write.sum [byte]
D gpu__time_duration.sum [nsecond]
E inst_executed [inst]
F smsp__pcsamp_warps_issue_stalled_barrier [warp]
G launch__grid_size
A B C D E F G
7680, 1, 1 19,53 (-46,68%) 537 105 472 (-6,46%) 8 617 984 (+94,02%) 58 722 656 (-0,27%) 0 (-100,00%) 30
122880, 1, 1 29,19 (-20,28%) 545 572 256 (-4,98%) 5 704 160 (+28,42%) 58 758 656 (-0,21%) 5 (-66,67%) 480
261120, 1, 1 34,91 (-4,67%) 555 609 248 (-3,24%) 4 692 608 (+5,64%) 58 801 856 (-0,14%) 8 (-46,67%) 1 020
522240, 1, 1 36,62 (+0,00%) 574 186 656 (+0,00%) 4 441 888 (+0,00%) 58 883 456 (+0,00%) 15 (+0,00%) 2 040
1044480, 1, 1 37,63 (+2,76%) 611 965 376 (+6,58%) 4 452 000 (+0,23%) 59 046 656 (+0,28%) 37 (+146,67%) 4 080
2088960, 1, 1 35,64 (-2,70%) 687 132 832 (+19,67%) 4 667 456 (+5,08%) 59 373 056 (+0,83%) 76 (+406,67%) 8 160
4177920, 1, 1 30,55 (-16,57%) 837 530 880 (+45,86%) 5 576 544 (+25,54%) 60 025 856 (+1,94%) 126 (+740,00%) 16 320
8355840, 1, 1 23,14 (-36,82%) 1 137 820 064 (+98,16%) 7 111 232 (+60,09%) 61 331 456 (+4,16%) 435 (+2 800,00%) 32 640
16711680, 1, 1 16,04 (-56,20%) 1 713 416 160 (+198,41%) 10 596 192 (+138,55%) 63 942 656 (+8,59%) 2 062 (+13 646,67%) 65 280
134217728, 1, 1 3,72 (-89,84%) 8 083 191 040 (+1 307,76%) 49 686 496 (+1 018,59%) 100 663 296 (+70,95%) 34 662 (+230 980,00%) 524 288
Here are metrics for 436.02 driver, which gives proper performance:
A B C D E F G
7680, 1, 1 19,82 536860736 8473696 50334048 0 30
122880, 1, 1 29,64 536894848 5619744 50370048 0 480
261120, 1, 1 36,95 536939136 4521600 50413248 0 1020
522240, 1, 1 39,45 536861184 4159424 50494848 0 2040
1044480, 1, 1 42,2 536818752 3891456 50658048 0 4080
2088960, 1, 1 43,26 536889728 3837888 50984448 0 8160
4177920, 1, 1 44,47 536977152 3712416 51637248 0 16320
8355840, 1, 1 45,44 536839552 3661920 52942848 0 32640
16711680, 1, 1 45,99 536782656 3633760 55554048 0 65280
134217728, 1, 1 45,88 536688512 3631104 92274688 0 524288
Metrices dram__bytes_write.sum and smsp__pcsamp_warps_issue_stalled_barrier are interesting…