I use Pytorch ImageNet Example, add import torch.cuda.profiler as profiler and break after the first iter.
profiler.start()
# compute output
output = model(images)
loss = criterion(output, target)
# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
profiler.stop()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % args.print_freq == 0:
progress.display(i + 1)
break
I use “ncu --profile-from-start off --metrics gpu__time_duration.sum,dram__bytes_read.sum,dram__bytes_write.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum --csv --page raw --log-file resnet2-l4.csv --target-processes all --clock-control none python main.py --dummy” to get number of float operations and bytes.
I use the following code to caculate:
def print_flops_for_dataframe(columns):
subsetted_dataframe=columns[["dram__bytes_read.sum", "dram__bytes_write.sum", "gpu__time_duration.sum", "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum", "smsp__sass_thread_inst_executed_op_fmul_pred_on.sum", "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum"]]
subsetted_dataframe=subsetted_dataframe.iloc[1:,:]
for col in subsetted_dataframe.columns:
subsetted_dataframe[col] = subsetted_dataframe[col].str.strip().str.replace(',', '').astype(float)
flops=(2*subsetted_dataframe['smsp__sass_thread_inst_executed_op_ffma_pred_on.sum']) + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fmul_pred_on.sum'] + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fadd_pred_on.sum']
flops=flops.sum()
print("flops.sum:",f"{flops/1000000000:.2f} GFLOPS")
time = (subsetted_dataframe['gpu__time_duration.sum'].sum())/1000000000
print("time in secs:",time)
total_bytes = (subsetted_dataframe['dram__bytes_read.sum'] + subsetted_dataframe['dram__bytes_write.sum']).sum()
print("total_bytes:",total_bytes)
OI=flops/total_bytes
print("OI:",OI)
Flop_sec = flops/time/1000000000
print("Flops/sec:",f"{Flop_sec:.2f} GFLOPS/sec")
The result is really strange. I use the same model, same batch size, but I get different flops and bytes on NVIDIA T4 and L4 (densenet.csv and resnet.csv is for T4). But when it comes to the time, that makes sense, L4 is much faster.