What can the most probable causes for a kernel which just starts and hangs ? have been debugging it more > 5 hrs … it just hangs…
I AM not using any communication via shared memory… all threads are working independently, hence I don’t know why is it happening ?
I am using ~ 350 MB of device memory.
Any probable reason ? greatly appreciate your input after 5+ hrs of debugging :(
I can paste the kernel code here or attach it if some one wants to see…
Thanks,
NA
Code Review is the panacea for all software bugs. Take a print out, and review it slowly over a cup of coffee.
It is so tempting to blame the hardware… I do it all the times only to find that its a software bug… :-)
What can the most probable causes for a kernel which just starts and hangs ? have been debugging it more > 5 hrs … it just hangs…
I AM not using any communication via shared memory… all threads are working independently, hence I don’t know why is it happening ?
I am using ~ 350 MB of device memory.
Any probable reason ? greatly appreciate your input after 5+ hrs of debugging :(
I can paste the kernel code here or attach it if some one wants to see…
Thanks,
NA
Did you use syncthreads() in your kernel? If so, make sure that each thread calls exactly the same number of syncthreads()…
Another cause could be some infinite loop inside your kernel, so remember to check your for loop termination conditions.
What can the most probable causes for a kernel which just starts and hangs ? have been debugging it more > 5 hrs … it just hangs…
I AM not using any communication via shared memory… all threads are working independently, hence I don’t know why is it happening ?
I am using ~ 350 MB of device memory.
Any probable reason ? greatly appreciate your input after 5+ hrs of debugging :(
I can paste the kernel code here or attach it if some one wants to see…
Thanks,
NA
As Sarnath specified you should review your code. You could also try to post it here.
A fast way, btw, would be to comment your entire kernel code, see it works ok and then start to open kernel lines till
you reach the offending code/loop. Probably something causes you a deadlock or infinite loop.
In this process make sure your kernel doesnt get optimized out by the dead-code optimizer.
eyal
Code Review is the panacea for all software bugs. Take a print out, and review it slowly over a cup of coffee.
It is so tempting to blame the hardware… I do it all the times only to find that its a software bug… :-)
Thanks… for the advice :)… found the bug
Yes it finally is working though not very good… just achieved 12 Gflops in double precision with 6x speed up.
:(
Did you use syncthreads() in your kernel? If so, make sure that each thread calls exactly the same number of syncthreads()…
Another cause could be some infinite loop inside your kernel, so remember to check your for loop termination conditions.
No as there is no inter-thread data communication hence I don’t use sync threads. Thanks for the input :) .
As Sarnath specified you should review your code. You could also try to post it here.
A fast way, btw, would be to comment your entire kernel code, see it works ok and then start to open kernel lines till
you reach the offending code/loop. Probably something causes you a deadlock or infinite loop.
In this process make sure your kernel doesnt get optimized out by the dead-code optimizer.
eyal
Yup that helped actually… the bug was in my code… Unallocated shared memory access inside a loop. Lame me… I was hoping for more dramatic performance, I found that I am loosing lot of performance while accessing (read and write) device memory. I have to access 546 + 42 elements from device memory 12 times PER THREAD :( for my current algorithm… I guess that is what is killing my application speed even , though I have like (PER THREAD) :
40*13 (funcevals)
+
122 42*13 (42 by13 mat-vec product done one column after another 12 times)
+
116 6*6( 6by 6 mat-mul 11 times)
flops / thread…
THAT’S A LOT OF FLOPS I KNOW , therefore I thought I should get lot of speed over cpu, but it also requires lot of data trasnfer/thread.
I guess I have to more fine grain the parallelism so that data transfer is less.
How much is the kernel launch overhead :unsure: ?
I will try multiple kernel launches to achieve this somehow.
Thanks all for listing to me External Media
NA
Also am using NVCC Version 2.0 and 177.67 driver…
Is there any major benefit in upgrading in terms of speed and all ( I am aware about new features but am not using any of the for now) ?
It will be some hassle, as am accessing the machine (for 5 days before I am back in US) via remote connection.
Thanks,
NA
Kernel launch overhead is very very very minimal.
COnsider queing kernel launches to reduce this overhead even further.
Okay thanks will surly then do that…
wat about the driver version ?
"Also am using NVCC Version 2.0 and 177.67 driver…
Is there any major benefit in upgrading in terms of speed and all ( I am aware about new features but am not using any of the for now) ?"
Thanks…
NA