Where exactly the bottleneck? cudaErrorLaunchOutOfResources cudaGraphInstantiate

Hello I learn CUDA and try to make neural nets with computation graphs.
I have GPU NVIDIA GT-710 with 2GB memory sm=35.
I made simple net with neurons in layers count 11-6-6 (+1 to each layer for bias neuron). It works well.
I want to make MNIST digits recognition 784-800-10 (+1 to each layer for bias neuron) and got this error:

code=701(cudaErrorLaunchOutOfResources) “cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0)”

When I try to do 784-256-10 - graph creates. Even if I add many layers e.g. 784-256-256-256-256-256-10
But if I try to create 784-257-10 - I got the error.
How can I exactly know what is the bottleneck and where?
Ubuntu 20.20LTS, CUDA 11 + QtCreator.

Problem was that I launched to many sum-reductions (cooperative_groups) in parallel. Before I figured it out, I split each kernel node parallelise parameters (threads and thread blocks) to quantity:
max threads per kernel call: 256, thread blocks: 15 - with these parameters my graph could take maximum parallel load. And call this node N times in parallel, passing thread block offset as parameter to kernel function.
After parallelizing kernel call that way I added an empty node to minimize dependencies and it works that way.
Then I replace 800 parallel launches of sum-reductions (cooperative_groups) to my sum reductions that split vector always to 9 parts at the first step and sum 9 sum results at the second.