Same resnext101 model size for dense and sparse

I am following instructions in Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog

but after I downloaded the dense and sparse models by:
ngc registry model download-version nvidia/resnext101_32x8d_sparse_onnx:1
ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1

Surprisingly, the 2 downloaded onnx models have exact file size:
354782502 Dec 20 17:57 resnext101_32x8d_pyt_torchvision_sparse.onnx
354782502 Dec 20 18:00 resnext101_32x8d_pyt_torchvision_dense.onnx

I expect the sparse model has smaller size, but both has exact file size, can this be the reason people reported no performance difference?

Hi,

Could you share the checksum of both models?

$ md5sum resnext101_32x8d_pyt_torchvision_sparse.onnx
$ md5sum resnext101_32x8d_pyt_torchvision_dense.onnx

Thanks.

$ md5sum resnext101_32x8d_pyt_torchvision_dense.onnx
49beb2920f6f6e42eb20b874a30eab98

$ md5sum resnext101_32x8d_pyt_torchvision_sparse.onnx
c962aeafd8a7000f3c72bbfcd2165572

Hi,

Have you try to infer it with TensorRT (ex. trtexec)?
The model might use the same data length to save sparse or dense model so the file size will be identical.

Thanks.

yes: here is the result:

  1. with resnext101_32x8d_pyt_torchvision_sparse.onnx
    Throughput: 146.487 qps
    Total Host Walltime: 3.01733 s
    Total GPU Compute Time: 3.00983 s

  2. with resnext101_32x8d_pyt_torchvision_dense.onnx
    Throughput: 116.562 qps
    Total Host Walltime: 3.02844 s
    Total GPU Compute Time: 3.01938 s

it seems there is about 25% performance improvement with Sparsity enabled.

A follow-up question here: at least for resnext101 model, about 25% performance improvement observed on Jetson AGX orin if the model is inferencing on GPU, what about DLA? is there any performance improvement with sparisity enabled when model runs on DLA?

Hi,

It depends on the model.

Although DLA can increase inference throughput, it is limited in the supported layer type.
If a model needs to fallback to GPU frequently, the data transfer overhead might slow down the performance.

Thank.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.