Hi everyone. I have two beginner-level questions about how to use the cuDNN graph API:
1. How big should individual graphs be?
The graph API is organized around building a graph of operations, finding an execution engine that can handle them, and running that. How big should a graph be? I gather that it’s not supposed to be the entire neural network (e.g. not something resnet-50 sized), but it also seems like it should be more than just a single conv+activation or whatever. How do I determine how many components to include in a single graph? For example, how would I go about finding the answers to questions like these:
- Should a resnet block (e.g. norm, conv, activation, conv, dropout, skip connection) be all one cuDNN graph?
- Should a transformer encoder block (e.g. norm, SPDA, skip connection, norm, MLP, skip connection) be all one cuDNN graph?
- Or, continuing with the transfomer encoder thought, should the SPDA be it’s own graph, and the MLP be it’s own graph, etc?
2. I don’t understand engine heuristic mode A and B, can someone explain?
I wasn’t able to follow the documentation here: cudnn_graph Library — NVIDIA cuDNN v9.4.0 documentation. Could someone please explain what each mode does? Specifically I’m not clear what “inference time on the CPU” means. Is a neural network involved in the selection of execution plan, and that neural network runs on the CPU? If that’s so, then the time cost of this inference could be amortized over many subsequent plan executions, making mode B generally better if you’re going to run the graph more than once… right? Or, have I completely misinterpreted this?
Thanks!