CUDA Kernel Crash

Hello,

We are facing one critical issue with our CUDA Kernel.

It seems that when GPU is delegated with calculation beyond some iterations [Threads], CUDA freeze the screen and program crashes within next few seconds however CUDA program works fine within some limit.

Logically there could be below reasons and Its very difficult to find root cause of this issue.

  • GPU Hardware limitation
  • System configuration
  • Dead lock in code which persist only beyond some limit - As far as this reason is concerned, We could not able to find the root cause since behavior is OK within some limit.

As an work around we decided to have multiple calls to CUDA kernel where iteration remains within limit however this affects performance drastically and is not full proof solutions because it may fail in few cases and not acceptable at all in terms of performance.

We are using NVidia Quadro K3100M GPU with 30 GBs RAM windows & system having Intel Xeon CPU.

I have checked with changing tdrDelay windows registry however there is no fix against that change.

PLEASE LET US KNOW IF SOME MORE INFORMATIONIS NEEDED.
ANY HELP WOULD BE GREAT HELP FOR US.

Thanks
Jaydeep Patel

Sounds like you are running into the ~2 sec limit of the operating system’s GUI watchdog timer. The GPU can service either the GUI (or other graphics) or a compute kernel. All operating systems supported by CUDA implement a GUI watchdog timer that ensures the GUI does not become unresponsive (“screen freezes”); a GPU reset is initiated when the specified time limit is exceeded, causing the CUDA context to be destroyed.

This is a well documented issue.

Possible solutions:

(1) Use a second GPU to service the GUI so first GPU can be configured for compute-only
(2) Disable the GUI watchdog timer (methods are OS dependent, Google is your friend)
(3) Shorten kernel run times by changes to your code
(4) Use a faster GPU to reduce kernel run time

Thanks for your time and attention on this thread.

No, I have changed the tdrDelay windows registry to 60 from 2 and still, I do face same issue.

From the possible solutions, Can you please help us with Sol# 2&3?

Solution#2 : Seems this is related to system configuration. How can I validate this?

Solution#3 : Do you have some idea on to reduce kernel run time with some special facility of CUDA except Shared Memory? If required, I can share more details on my code and you can help me better.

Looking for your reply as soon as possible.

Thanks

Best Regards,
Jaydeep Patel

I know that one can change or disable the Windows GUI timeout, but I have never done it. You may have changed a registry to an incorrect or insufficient value, or you may have to change more than one registry entry.

My preferred solution is (1): use a second GPU for the GUI. Any cheap low-end GPU should be sufficient for driving the desktop. I have also used solution (4): use a faster GPU for CUDA processing.

You can shorten kernel runtime either by optimizing the kernel itself, with the help of the CUDA profiler, and/or by giving less work to each kernel invocation. Say you are operating on large images, instead of processing a full image per kernel call, operate on a quarter image per kernel call.

If you look at distributed processing projects like Folding @ Home, they are able to break complex and lengthy computations down into parts that are manageable by GPUs ranging from low-end to high-end, spanning multiple GPU generations. So partitioning the work to fit into the watchdog timer limit is generally possible.

At present I don’t do consulting beyond dispensing (limited) free advice on the internet.

the described situation is strongly consistent with the WDDM TDR timeout. My best guess is that you have incorrectly modified the timeout. A fairly reliable method is to do it within nsight VSE. Please read the documentation or google for that if you wish: it is documented.

Alternatively, try running your code on a platform that does not have the timeout, such as a linux platform with no X configured display.

I have got an idea to check with tdrDelay from Google only.

Since we don’t have Nsight, I have to change tdrDelay in Windows registry and then restarted system however there isn’t any impact of that change on the issue we are facing.

Just help with the change we do for tdrDelay is correct way to do it or not.

You increased the timeout value from 2 to 60 seconds. Consider the possibility that your kernel may run longer than 60 seconds. Maybe there is a bug in the code that causes the kernel to run much longer than you expect, or possibly never terminate at all.

I would like to share more details on this crash issue.

We do have one device function as shown below.

__device__ bool CheckIntersectionWithTriangle2(const Triangle *T, const Ray *R)
{
	float dst;
	vect3 I;
	vect3 v;
	vect3 n;
	float cond1;
	float cond2;
	float alpha;
	float beta;
	float gema;
	
	v = VectNorm(R->D);
	n = VectCross(VectSub(T->T2,T->T1),VectSub(T->T3,T->T1));
	
	cond1 = VectDot(n,v);
	
	if(cond1>=-EPSILON && cond1<=EPSILON)
	{
		dst = FLOAT_INF;
		return false;
	}
	
	float t = (n.x*T->T1.x-n.x*R->O.x+n.y*T->T1.y-n.y*R->O.y+n.z*T->T1.z-n.z*R->O.z)/
		(n.x*R->D.x+n.y*R->D.y+n.z*R->D.z);

	if(t<0)
	{
		dst = FLOAT_INF;
		return false;
	}
	
	I.x = R->O.x+t*R->D.x;
	I.y = R->O.y+t*R->D.y;
	I.z = R->O.z+t*R->D.z;
		
	cond2 = VectDot(n,VectCross(VectSub(T->T2,T->T1),VectSub(T->T3,T->T1)));
	alpha = VectDot(n,VectCross(VectSub(T->T2,I),VectSub(T->T3,I)))/cond2;
	if(fabs(alpha)<=EPSILON){alpha = 0.0;}
		
	beta = VectDot(n,VectCross(VectSub(T->T3,I),VectSub(T->T1,I)))/cond2;
	if(fabs(beta)<=EPSILON){beta = 0.0;}
	gema = 1.0f-alpha-beta;
	if(fabs(gema)<=EPSILON){gema = 0.0;}

	__syncthreads();
	if((alpha>=0.0f && alpha<=1.0f) && (beta>=0.0f && beta<=1.0f) &&(gema>=0.0f && gema<=1.0f))
	{
		dst = t;
		return true;
	}
	else
	{
		dst = FLOAT_INF;
		return false;
	}	
}

When I comment out below piece of code from above device function, Then CUDA doesn’t Crash.

if((alpha>=0.0f && alpha<=1.0f) && (beta>=0.0f && beta<=1.0f) &&(gema>=0.0f && gema<=1.0f))
	{
		dst = t;
		return true;
	}
	else
	{
		dst = FLOAT_INF;
		return false;
	}

I suspect there is some issue with thread not reaching to this conditional statement [Thread Racing] and which cause this issue. As a fix, I inserted __syncthreads() statement just above conditional statement but still issue is not resolved.

I hope, I am able to convey the detailed information on this issue.

Please let me know if you need more details and Please suggest some solutions.

Thanks
Jaydeep Patel

When you remove that piece of code, the compiler observes that the rest of the function doesn’t modify any global state, and so can optimize it out (it may even optimize out other code dependent on this). This may substantially reduce the running time of the kernel, again, supporting the idea of a this simply being a kernel timeout.

This sort of “commenting out” debugging is at interpretation risk due to compiler optimization. It is easy to misinterpret performance behavior using this method.

So, If I understand correctly, Below are 2 possible solutions.

  1. Increase timeout - Since I can not see any effect of changing tdrDelay windows registry on this issue, Please share step by step procedure to change timeout effectively.
  2. Split this function into multiple kernels as currently it is being called from single kernel in order to reduce run time per kernel - How can I ensure that after splitting into multiple kernels solves the current issue on one test data and will not happen again on other data. What should be the parameter to be checked to ensure this?

Out of this 2 options, Which do you think is better and Please suggest if there is any other option for this issue.

Hello,

Please guide us with detailed procedure to increase time out.

As I have already mentioned, That I can not see any effect of trdDelay registry on my crash issue.

Please help.

Thanks & Regards,
Jaydeep Patel

The exact method to modify the windows registry may vary by windows version, and may even vary for a particular windows version (e.g. windows 7) depending on your exact windows install. In general, instructions to modify the registry directly by an end-user with a registry editor are quite fragile for this reason.

Therefore the suggestion I offer (which I’ve offered already) is to use the nsight VSE tool to do this. It is documented. That tool is available as a free download from NVIDIA, as part of the windows CUDA toolkit installer. If you don’t wish to do that, I don’t have alternate suggestions. There are many examples on the web of modification instructions you can try.

Hey,

is the issue resolved ?
Imo it is really WDDM TDR timout problem, but do you maybe have more information about the freezes and crashes ?
Like, can you send us maybe the information from the windows event viewer ? There should be critical errors from the crashes.

yes,Thanks for your help.

Facing new issue related to CUDA toolkit installation on new system.

Already created thread https://devtalk.nvidia.com/default/topic/1028411/cuda-setup-and-installation/installtion-error/

Please guide.

Thanks,
Jaydeep Patel