Different Flops on Different GPUs

I use Pytorch ImageNet Example, add import torch.cuda.profiler as profiler and break after the first iter.

            profiler.start()
            # compute output
            output = model(images)
            loss = criterion(output, target)

            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))

            # compute gradient and do SGD step
            optimizer.zero_grad()
            loss.backward()
            profiler.stop()
            
            optimizer.step()

            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()

            if i % args.print_freq == 0:
                progress.display(i + 1)
            break

I use “ncu --profile-from-start off --metrics gpu__time_duration.sum,dram__bytes_read.sum,dram__bytes_write.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum --csv --page raw --log-file resnet2-l4.csv --target-processes all --clock-control none python main.py --dummy” to get number of float operations and bytes.
I use the following code to caculate:

def print_flops_for_dataframe(columns):
  subsetted_dataframe=columns[["dram__bytes_read.sum", "dram__bytes_write.sum", "gpu__time_duration.sum", "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum", "smsp__sass_thread_inst_executed_op_fmul_pred_on.sum", "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum"]]
  subsetted_dataframe=subsetted_dataframe.iloc[1:,:]
  for col in subsetted_dataframe.columns:
      subsetted_dataframe[col] = subsetted_dataframe[col].str.strip().str.replace(',', '').astype(float)
  flops=(2*subsetted_dataframe['smsp__sass_thread_inst_executed_op_ffma_pred_on.sum']) + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fmul_pred_on.sum'] + subsetted_dataframe['smsp__sass_thread_inst_executed_op_fadd_pred_on.sum']
  flops=flops.sum()
  print("flops.sum:",f"{flops/1000000000:.2f} GFLOPS")
  time = (subsetted_dataframe['gpu__time_duration.sum'].sum())/1000000000
  print("time in secs:",time)
  total_bytes = (subsetted_dataframe['dram__bytes_read.sum'] + subsetted_dataframe['dram__bytes_write.sum']).sum()
  print("total_bytes:",total_bytes)
  OI=flops/total_bytes
  print("OI:",OI)
  Flop_sec = flops/time/1000000000
  print("Flops/sec:",f"{Flop_sec:.2f} GFLOPS/sec")


The result is really strange. I use the same model, same batch size, but I get different flops and bytes on NVIDIA T4 and L4 (densenet.csv and resnet.csv is for T4). But when it comes to the time, that makes sense, L4 is much faster.

The best method would be to collect a full report and compare the generated SASS. The code generation for the two GPUs may differ quite a bit as L4 (Ada) has 2x FP32 over T4 (Turing). The compiler may take advantage of that. I would not expect to different GPU architectures to provide the same FLOP value.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.