I am new to CUDA programming and I am working on a project with a very close deadline. I have parallelized a code (basically made it work on CUDA) but I realize that it is not even close to optimal in the following matters:
- I have a large datastructure (a C struct with 3 two dimensional arrays and 3 one dimensional arrays) which makes two trips, to and fro to the GPU in a loop running atleast a 1000 times. (I am freeing the device pointers everytime the data is copied back to the host data structure).
- I have a large 2D data array (wrapped in a C struct) containing 1,000,000x100 floating point values which is copied to the GPU once during the beginning of the execution and is stripped down at the end of the execution.
- Each kernel launch is very heavy. I launch about 50 threads and each thread launch does a lot of computation. There is scope of dynamic parallelism in each of the 50 but initially, the naive way to parallelize was to launch these 50 threads.
My problem:
My code works fine for the 2D data array with dimensions 600x10 but when I run the same code on the data with dimensions 1,000,000x100 (size of the file is about 600MB) on a quadro p5000 with 16 gigs of RAM the program crashes at the first kernel launch-
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaLaunch.
All the preceding calls to CUDA API (cudaMalloc, cudaMemcpy) return cudaSuccess.
Here are the sizes of the data structures printed from the code:
Data structure mentioned in #1: 244400 bytes
Data structure mentioned in #2: 418578340 bytes
When I reduce the data array to 100,000x10, kernel launches but fails after couple iterations with memory error.
Could someone please shed some light on how I should go about fixing this?
Sorry for the verbose post.