Performance of 6000 Ada vs. H100 for multi-modal object detection training

salscheider · December 5, 2024, 12:15pm

Hello,

we are currently evaluating the performance differences between the RTX 6000 Ada and the H100 in some real-world tasks. For this, we focus on training multi-modal object detection models, and specifically Sparse4D v3.

I ran our benchmark setting on both GPUs, but the machines had slightly different specifications:
Setting 1:

1x 6000 Ada
16 cores of AMD EPYC 9354
125 GB RAM
Unknown local SSD (mini dataset will easily fit into cache in RAM, so it should not matter)

Setting 2:

1x H100 SMX
16 cores of Intel Xeon Platinum 8462Y+
250 GB RAM
Unknown local SSD (mini dataset will easily fit into cache in RAM, so it should not matter)

With this, I measured the time for one training step to be 1.64s on the RTX 6000 Ada (including a data time of 0.07s). The time on the H100 was 1.02s (data_time: 0.05s).

In other words, the H100 had 162% of the performance of the RTX 6000 Ada.
On another in-house benchmark with similar characteristics, the H100 even only achieved 135% of the performance of the RTX 6000 Ada.

I would like to understand what speedup I should in general expect on similar ML tasks. When comparing e.g. the theoretical fp16 tensor core throughput, the H100 should have 273% of the performance of the RTX 6000 Ada.

You can find the necessary steps to reproduce my benchmark results here:

github.com

SafeAD-GmbH/Sparse4D/blob/main/HOWTO_BENCHMARK

python3 -m venv venv
source venv/bin/activate
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

git clone https://github.com/SafeAD-GmbH/Sparse4D
cd Sparse4D
pip install --upgrade -r requirement.txt
cd projects/mmdet3d_plugin/ops
python3 setup.py develop
cd ../../../

mkdir -p data/nuscenes
cd data/nuscenes

# Download the NuScenes mini dataset (v1.0-mini.tgz) from this url:
# https://www.nuscenes.org/nuscenes#download
# and extract it
wget https://d36yt3mvayqw5m.cloudfront.net/public/v1.0/v1.0-mini.tgz
tar -xvf v1.0-mini.tgz

This file has been truncated. show original

Best regards,

Ole

CommanderLake · December 19, 2024, 9:49am

Verify the temperature and utilization of the H100 during the benchmark and if possible use Nsight Compute to verify the occupancy of the benchmark.

Topic		Replies	Views
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10857	January 18, 2008
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10946	August 15, 2010
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10541	September 14, 2010
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17470	June 8, 2010
Disappointed performance using C2050 CUDA Programming and Performance	20	7916	September 2, 2010
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	25974	November 4, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8784	September 22, 2010
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46717	May 6, 2010
GF100 vs GF104 Performance question CUDA Programming and Performance	18	9062	September 4, 2010
Tesla V100 GPU way too slow CUDA Programming and Performance	8	6654	December 21, 2017

Performance of 6000 Ada vs. H100 for multi-modal object detection training

Related topics