Thanks for that. However, the FP64 seems to be normal. Please see the following results.
I wonder if there is a problem/bug with the way that IPC is reported.
$ nv-nsight-cu-cli -k kernel --metrics smsp__inst_executed.avg.per_cycle_active,smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active,smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_peak_sustained_elapsed ./lavaMD -boxes1d 10
thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30357 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 3 passes
Time spent in different stages of GPU_CUDA KERNEL:
0.295962005854 s, 42.142059326172 % : GPU: SET DEVICE / DRIVER INIT
0.000716999988 s, 0.102093696594 % : GPU MEM: ALO
0.000874000019 s, 0.124448947608 % : GPU MEM: COPY IN
0.403941005468 s, 57.517200469971 % : GPU: KERNEL
0.000426000013 s, 0.060658186674 % : GPU MEM: COPY OUT
0.000376000011 s, 0.053538680077 % : GPU MEM: FRE
Total time:
0.702296018600 s
==PROF== Disconnected from process 30357
[30357] lavaMD@127.0.0.1
_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:46:51, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
smsp__inst_executed.avg.per_cycle_active inst/cycle 0,02
smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active % 21,62
smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_ % 15,14
peak_sustained_elapsed
---------------------------------------------------------------------- --------------- ------------------------------
$ nv-nsight-cu-cli -k kernel ./lavaMD -boxes1d 10 thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30379 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 8 passes
Time spent in different stages of GPU_CUDA KERNEL:
0.298682987690 s, 20.712144851685 % : GPU: SET DEVICE / DRIVER INIT
0.000508000026 s, 0.035227213055 % : GPU MEM: ALO
0.000913000025 s, 0.063311897218 % : GPU MEM: COPY IN
1.141149044037 s, 79.132865905762 % : GPU: KERNEL
0.000433999987 s, 0.030095688999 % : GPU MEM: COPY OUT
0.000380000012 s, 0.026351064444 % : GPU MEM: FRE
Total time:
1.442067027092 s
==PROF== Disconnected from process 30379
[30379] lavaMD@127.0.0.1
_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:47:45, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 9,24
SM Frequency cycle/nsecond 1,44
Elapsed Cycles cycle 89.401.499
Memory [%] % 0,93
SOL DRAM % 0,03
Duration msecond 62,08
SOL L1/TEX Cache % 1,05
SOL L2 Cache % 0,20
SM Active Cycles cycle 78.852.295,93
SM [%] % 76,27
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 128
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1.000
Registers Per Thread register/thread 46
Shared Memory Configuration Size Kbyte 102,40
Driver Shared Memory Per Block Kbyte/block 1,02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 7,20
Threads thread 128.000
Waves Per SM 1,47
---------------------------------------------------------------------- --------------- ------------------------------
WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 319 thread blocks.
Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for
up to 50.0% of the total kernel runtime with a lower occupancy of 20.6%. Try launching a grid with no
partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for
a grid.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 10
Block Limit Shared Mem block 12
Block Limit Warps block 12
Theoretical Active Warps per SM warp 40
Theoretical Occupancy % 83,33
Achieved Occupancy % 66,18
Achieved Active Warps Per SM warp 31,77
---------------------------------------------------------------------- --------------- ------------------------------
$ nv-nsight-cu-cli -k kernel --metrics smsp__sass_thread_inst_executed_op_conversion_pred_on.sum,smsp__sass_thread_inst_executed_op_memory_pred_on.sum,smsp__sass_thread_inst_executed_op_control_pred_on.sum,smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_fp64_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum,smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.sum,smsp__sass_thread_inst_executed_op_misc_pred_on.sum,smsp__issue_active.avg.pct_of_peak_sustained_active,smsp__inst_issued.avg.per_cycle_active,smsp__thread_inst_executed_per_inst_executed.ratio ./lavaMD -boxes1d 10
thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30407 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 3 passes
Time spent in different stages of GPU_CUDA KERNEL:
0.291828989983 s, 41.035057067871 % : GPU: SET DEVICE / DRIVER INIT
0.000501999981 s, 0.070587903261 % : GPU MEM: ALO
0.000921999977 s, 0.129645511508 % : GPU MEM: COPY IN
0.417104989290 s, 58.650535583496 % : GPU: KERNEL
0.000429000007 s, 0.060323130339 % : GPU MEM: COPY OUT
0.000383000006 s, 0.053854916245 % : GPU MEM: FRE
Total time:
0.711170017719 s
==PROF== Disconnected from process 30407
[30407] lavaMD@127.0.0.1
_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:53:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
smsp__inst_issued.avg.per_cycle_active inst/cycle 0,02
smsp__issue_active.avg.pct_of_peak_sustained_active % 2,02
smsp__sass_thread_inst_executed_op_control_pred_on.sum inst 790.896.192
smsp__sass_thread_inst_executed_op_conversion_pred_on.sum inst 0
smsp__sass_thread_inst_executed_op_fp32_pred_on.sum inst 219.520.000
smsp__sass_thread_inst_executed_op_fp64_pred_on.sum inst 7.244.416.000
smsp__sass_thread_inst_executed_op_integer_pred_on.sum inst 1.842.891.872
smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on. inst 0
sum
smsp__sass_thread_inst_executed_op_memory_pred_on.sum inst 706.616.512
smsp__sass_thread_inst_executed_op_misc_pred_on.sum inst 6.131.712
smsp__thread_inst_executed_per_inst_executed.ratio 25,04
---------------------------------------------------------------------- --------------- ------------------------------
The number of FP64 instructions is considerable. However, the utilization it not 100%.