cuStreamBeginCapture overhead vs. specialized nodes?

Before I go writing a bunch of code for a test, can anyone tell me right off if this is “clearly a bad idea” or “unclear enough that it must be measured, not predicted”.

It is possible to hand-roll equivalents to CUDA graph Host/Kernel/Memcpy/Memset nodes using captured streams. But, would a large graph composed of a large number of child graphs from captured streams have a significant overhead compared to an equivalent graph composed of specialized nodes?

Bump