Different Flops on Different GPUs

eshoyuan · April 8, 2024, 6:08pm

I use Pytorch ImageNet Example, add import torch.cuda.profiler as profiler and break after the first iter.

            profiler.start()
            # compute output
            output = model(images)
            loss = criterion(output, target)

            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))

            # compute gradient and do SGD step
            optimizer.zero_grad()
            loss.backward()
            profiler.stop()
            
            optimizer.step()

            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()

            if i % args.print_freq == 0:
                progress.display(i + 1)
            break

I use “ncu --profile-from-start off --metrics gpu__time_duration.sum,dram__bytes_read.sum,dram__bytes_write.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum --csv --page raw --log-file resnet2-l4.csv --target-processes all --clock-control none python main.py --dummy” to get number of float operations and bytes.
I use the following code to caculate:

def print_flops_for_dataframe(columns):
  subsetted_dataframe=columns[["dram__bytes_read.sum", "dram__bytes_write.sum", "gpu__time_duration.sum", "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum", "smsp__sass_thread_inst_executed_op_fmul_pred_on.sum", "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum"]]
  subsetted_dataframe=subsetted_dataframe.iloc[1:,:]
  for col in subsetted_dataframe.columns:
      subsetted_dataframe[col] = subsetted_dataframe[col].str.strip().str.replace(',', '').astype(float)
  flops=(2*subsetted_dataframe['smsp__sass_thread_inst_executed_op_ffma_pred_on.sum']) + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fmul_pred_on.sum'] + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fadd_pred_on.sum']
  flops=flops.sum()
  print("flops.sum:",f"{flops/1000000000:.2f} GFLOPS")
  time = (subsetted_dataframe['gpu__time_duration.sum'].sum())/1000000000
  print("time in secs:",time)
  total_bytes = (subsetted_dataframe['dram__bytes_read.sum'] + subsetted_dataframe['dram__bytes_write.sum']).sum()
  print("total_bytes:",total_bytes)
  OI=flops/total_bytes
  print("OI:",OI)
  Flop_sec = flops/time/1000000000
  print("Flops/sec:",f"{Flop_sec:.2f} GFLOPS/sec")

The result is really strange. I use the same model, same batch size, but I get different flops and bytes on NVIDIA T4 and L4 (densenet.csv and resnet.csv is for T4). But when it comes to the time, that makes sense, L4 is much faster.

Greg · April 11, 2024, 10:24pm

The best method would be to collect a full report and compare the generated SASS. The code generation for the two GPUs may differ quite a bit as L4 (Ada) has 2x FP32 over T4 (Turing). The compiler may take advantage of that. I would not expect to different GPU architectures to provide the same FLOP value.

system · April 25, 2024, 10:25pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	5441	October 24, 2023
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14276	August 7, 2008
GPU metrics in the Nsight System Profiling Linux Targets	3	873	October 15, 2024
Kernel pipeline slows gradually CUDA Programming and Performance	11	62	December 21, 2024
maximum flops? CUDA Programming and Performance	5	3259	June 15, 2009
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6547	February 19, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7164	November 28, 2007
Overheads monitored by NCU for profiling DNN workloads Nsight Compute	2	112	February 28, 2025
Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize Legacy PGI Compilers	4	880	June 23, 2020

Different Flops on Different GPUs

Related topics