Out of memory when allocating local memory

Yes, there is another limit on local memory (and related: stack, since stack manifests in the logical local space per thread). njuffa has described it here

I think if you run through that math for your V100 GPU, you will find the problem. The calculation will show that your 210816 byte per thread request requires 34540093440 bytes when considered device-wide (for V100 device), and that exceeds the 32GB available on your GPU. (Anticipating: No, the launch configuration <<<1,1>>> is not considered in this analysis.)