i am dealing with some problems with CUDA 2.0 on a Quadro FX1700 512mb.
I am converting a number of algorithms from cpu to cuda. Everything seems fine however, i am getting a number of errors that i am still unable to fix (went through the docs, but i am not experienced yet with cuda).
The application uses a series of kernels (so far only two) to convert a large volume to an image. This means that the data passed to cuda could easily exceed 100mb and go up to 300mb.
Problem 1 Kernel 1: This kernel receives the loaded volume as an input, uses a block of 16,16, a grid of imagedims/block_size and has a loop that goes through each slice of the volume performing some calculations. During execution though the application crashes with an out of memory error and device restart (on vista). However, if i change the loop so that it goes from slice to to numSlices * 0.5 and add a second kernel that does the same but loops from numSlices * 0.5 to numSlices, then i get no errors. Why is that and is there any way to avoid this?
Problem2 Kernel 2: If i manage to go to the second kernel, which uses same block and grid, the kernel shows an error of too many resources upon launch.
Running nvcc with -ptxas options=v i get the following output for each kernel:
ptxas info : Compiling entry function ‘__globfunc__Z18KERNEL2PsS_4int3S0_s6float3Pf’
ptxas info : Used 40 registers, 68+64 bytes smem, 52 bytes cmem[1]
ptxas info : Compiling entry function ‘__globfunc__Z18KERNEL1PsiiiiPdS0_S0_S0_S0_S0_Pis’
ptxas info : Used 23 registers, 9600+0 bytes lmem, 66+64 bytes smem, 8 bytes cmem[1]
ptxas info : Compiling entry function ‘__globfunc__Z17KERNEL1PART2PsiiiiPdS0_S0_S0_S0_S0_Pis’
ptxas info : Used 23 registers, 9600+0 bytes lmem, 66+64 bytes smem, 8 bytes cmem[1]
kernel1 doesnt seem to be that heavy in order to cause such an exception.
kernel2 on the other side, is heavy only because it needs 40 registers, but i have run succesfully projects from the nvidia samples that use much heavier kernels and a larger number of threads.
All the data im passing i am doing proper cudamalloc and i think my code is on the right direction but I am still a bit inexperienced with cuda so i guess i am missing something here . Any help would be really appreciated. Can you please help me out?
as for the first – you probably hit the timelimit of kernel execution (watchdog)
for the second - check if number_of_threads_per_block*your_registers_usage < 8192
and if shared memory usage is below 16kb
(and ofcourse if your blocks and grid dimensions are within limit)
ok played with it a bit more, tried reducing the block size. doesnt crash now at first, but after two or three frames. seems to be a bit unstable, since cuda returns an unknown error. is there any way to trace that?
using emulation mode, i get no errors.
how can i find out the time limit of kernel execution?
and how can i find out the limits of blocks and grid dimensions?
Well, it might be that your kernel is accessing memory addresses that it shouldn’t, that is usually a guaranteed crash. Without seeing the code it’s hard to make suggestions what could be causing the crashes.
is there any way to exceed that? on kernel1 i am performing some mathematical calculations that would require some time to finish. I am sure it doesn’t take more than 5 seconds so far, but it could easily exceed it. Can i alter the kernel execution time limit?
EDIT: I get a crash that seems to happen at random times on kernel1. I get an out of memory error. It is same data, but every once out of three times vista reboots the device. I have checked and lowered the block count, and register and shared memory are same as posted above. It seems i am not exceeding anything that could cause that. In my opinion, CUDA seems to be a bit unstable. Otherwise, why does it crash at random times?
P.S. Could it be because of my OS? I am running Vista 64-bit, but have installed the 32-bit SDK as the 64-bit one was causing problems.
i wish i could post the code, but is impossible at the current state since its quite big and the main project is in java.