Code does not run with larger file

rohitsg · October 17, 2017, 10:24am

I am new to CUDA programming and I am working on a project with a very close deadline. I have parallelized a code (basically made it work on CUDA) but I realize that it is not even close to optimal in the following matters:

I have a large datastructure (a C struct with 3 two dimensional arrays and 3 one dimensional arrays) which makes two trips, to and fro to the GPU in a loop running atleast a 1000 times. (I am freeing the device pointers everytime the data is copied back to the host data structure).
I have a large 2D data array (wrapped in a C struct) containing 1,000,000x100 floating point values which is copied to the GPU once during the beginning of the execution and is stripped down at the end of the execution.
Each kernel launch is very heavy. I launch about 50 threads and each thread launch does a lot of computation. There is scope of dynamic parallelism in each of the 50 but initially, the naive way to parallelize was to launch these 50 threads.

My problem:
My code works fine for the 2D data array with dimensions 600x10 but when I run the same code on the data with dimensions 1,000,000x100 (size of the file is about 600MB) on a quadro p5000 with 16 gigs of RAM the program crashes at the first kernel launch-

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaLaunch.

All the preceding calls to CUDA API (cudaMalloc, cudaMemcpy) return cudaSuccess.

Here are the sizes of the data structures printed from the code:
Data structure mentioned in #1: 244400 bytes
Data structure mentioned in #2: 418578340 bytes

When I reduce the data array to 100,000x10, kernel launches but fails after couple iterations with memory error.

Could someone please shed some light on how I should go about fixing this?

Sorry for the verbose post.

cbuchner1 · October 17, 2017, 11:51am

Just a wild guess: Don’t exceed maximum block and grid limits during kernel launch.

print your blockDim and gridDim variables (all components) and compare with the maximum dimensions
that a CUDA deviceQuery allows. If you are using dynamic parallelism, also make sure these subsequent kernel launches don’t exceed limits.

Also make sure you are not building for Compute 2.0 architecture by accident. This would limit gridDim.x to 65535.

It might also be that you are exceeding register or shared memory limits for this larger kernel launches.

rohitsg · October 17, 2017, 3:20pm

Thank you for the insights. I face the error even when I launch a single thread (not to mean that I am overloading all the computations onto a single thread but instead of painting 100 houses with 100 threads, I am painting only one house).

Also, I have not used local or shared memory yet (I want to improvise once this works). So we could rule out the possibility of those memories overflowing.

Update: I modified the code to use unified memory concept to eliminate the possibilities of human errors induced by cudaMalloc or cudaMemcpy but the result is exactly.the.same.

Any further suggestions will be really helpful.

Topic		Replies	Views
Kernel With Grid Size in GPU Memory CUDA Programming and Performance	2	459	July 30, 2023
need suggestion for a 4D data computation project CUDA Programming and Performance	9	1089	December 17, 2018
Launching Kernel Fail CUDA Programming and Performance	15	3409	May 28, 2014
Kernels fail to launch after a certain blockDim.x CUDA Programming and Performance	2	932	January 6, 2012
CUDA kernels keep on crashing CUDA Programming and Performance	6	3646	October 27, 2008
Maximum size of memory block in cudaMallocManaged() CUDA Programming and Performance	7	2550	November 28, 2017
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Problem with multiGPU code using cuda Threads to launch the same kernel to multiple GPUs CUDA Programming and Performance	3	1018	March 24, 2009
Limit on the size of data that can be processed by a kernel Newbie question CUDA Programming and Performance	2	1348	January 16, 2009
addKernel Launch Failed: Invalid Argument CUDA Programming and Performance	2	1955	December 11, 2015

Code does not run with larger file

Related topics