I have attempted to run FP8 E4M3 wgrad (gradient wrt weight) operation for 2D convolution. While I am able to get some speedups for dgrad (gradient wrt input), wgrad seems to be extremely slow compared to both FP32 and FP16, often 50x to 100x slower.
The script I used to invoke the kernels can be accessed here. I have used cudnn-frontend with Graph API, which seems to be the preferred way to invoke these functionalities.
I have attached the profiling results in this Google spreadsheet. For each input size I have measured fp16 wgrad and fp8 wgrad with a number of different variants (wrt the IO/intermediate/compute data types).
The environment on which these experiments are performed is:
GPU: NVIDIA H100 80GB HBM3
Distribution: Ubuntu 22.04.4 LTS
CUDA SDK: 12.5.0
cuDNN version: v90300
I am happy to provide more information wrt my environment and/or the experiments performed.
Thanks for the suggestion, although I could not observe speedups after setting the flag.
When profiling this with nsight compute, it complains about “small grid”, which led me to suspect that this might be some internal bug with algorithms in cuDNN.
Have you checked the instructions being used by the generated kernels to see if fp8 tensors are actually being used?
Try these changes:
// When creating execution plans:
if (!graph->create_execution_plans({fe::HeurMode_t::A, fe::HeurMode_t::FALLBACK}).is_good())
throw std::runtime_error("Failed to create execution plans");
// When building plans:
if (!graph->build_plans(handle, fe::BuildPlanPolicy_t::HEURISTICS_CHOICE | fe::BuildPlanPolicy_t::TENSOR_OP).is_good())
throw std::runtime_error("Failed to build plans");