Thrust dispatch_scan.cuh, 372 out of memory with Cuda 11.8

I am using Win10 and Cuda 11.8 and VS 2019 C++.
I have a small command line app that uses cuda kernels and thrust::inclusive_scan. Everything works fine in debug mode.

I cut and paste the thrust::inclusive_scan into my 300,000 line C++ program.
In debug mode at runtime I get,

CUDA error 2 [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\cub\device\dispatch/dispatch_scan.cuh, 372]: out of memory

I am performing the scan on a 200kb buffer.

Here is the funky bit.
When I compile the code in my stand alone program it compiles fine.
When I compile the code in my 300,000 line C++ program I get these warning messages

1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\thrust/detail/alignment.h(139): warning C4324: ‘thrust::detail::aligned_type<2>::type’: structure was padded due to alignment specifier
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\thrust/detail/alignment.h(140): warning C4324: ‘thrust::detail::aligned_type<4>::type’: structure was padded due to alignment specifier
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\thrust/detail/alignment.h(141): warning C4324: ‘thrust::detail::aligned_type<8>::type’: structure was padded due to alignment specifier

I verified that both source files in both VS solutions are compiled the same way.
I am running on an A100-40gb card.

Any ideas why I am crashing at runtime?
Any ideas why the compiler warnings are present?
–Bob

My first guess would be that you are out of GPU memory. Depending on what you are doing and what thrust is doing, exactly, this doesn’t necessarily mean you have used up 40GB of memory. The device-side allocation carves out of a much smaller space, 8MB, so if there are any differences at all between your GPU usage in the two cases, you might be hitting a limit like that.