Code does not run with larger file

I am new to CUDA programming and I am working on a project with a very close deadline. I have parallelized a code (basically made it work on CUDA) but I realize that it is not even close to optimal in the following matters:

  1. I have a large datastructure (a C struct with 3 two dimensional arrays and 3 one dimensional arrays) which makes two trips, to and fro to the GPU in a loop running atleast a 1000 times. (I am freeing the device pointers everytime the data is copied back to the host data structure).
  2. I have a large 2D data array (wrapped in a C struct) containing 1,000,000x100 floating point values which is copied to the GPU once during the beginning of the execution and is stripped down at the end of the execution.
  3. Each kernel launch is very heavy. I launch about 50 threads and each thread launch does a lot of computation. There is scope of dynamic parallelism in each of the 50 but initially, the naive way to parallelize was to launch these 50 threads.

My problem:
My code works fine for the 2D data array with dimensions 600x10 but when I run the same code on the data with dimensions 1,000,000x100 (size of the file is about 600MB) on a quadro p5000 with 16 gigs of RAM the program crashes at the first kernel launch-

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaLaunch.

All the preceding calls to CUDA API (cudaMalloc, cudaMemcpy) return cudaSuccess.

Here are the sizes of the data structures printed from the code:
Data structure mentioned in #1: 244400 bytes
Data structure mentioned in #2: 418578340 bytes

When I reduce the data array to 100,000x10, kernel launches but fails after couple iterations with memory error.

Could someone please shed some light on how I should go about fixing this?

Sorry for the verbose post.

Just a wild guess: Don’t exceed maximum block and grid limits during kernel launch.

print your blockDim and gridDim variables (all components) and compare with the maximum dimensions
that a CUDA deviceQuery allows. If you are using dynamic parallelism, also make sure these subsequent kernel launches don’t exceed limits.

Also make sure you are not building for Compute 2.0 architecture by accident. This would limit gridDim.x to 65535.

It might also be that you are exceeding register or shared memory limits for this larger kernel launches.

Thank you for the insights. I face the error even when I launch a single thread (not to mean that I am overloading all the computations onto a single thread but instead of painting 100 houses with 100 threads, I am painting only one house).

Also, I have not used local or shared memory yet (I want to improvise once this works). So we could rule out the possibility of those memories overflowing.

Update: I modified the code to use unified memory concept to eliminate the possibilities of human errors induced by cudaMalloc or cudaMemcpy but the result is exactly.the.same.

Any further suggestions will be really helpful.