emu vs debug, different values

Pittsburgh · September 16, 2008, 2:00pm

I have the following after my kernel call. Is this what you mean? I do not get an error on kernel launch.

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

kyprizel · September 16, 2008, 2:13pm

It’s my mistake - i was trying to track errors in Release version, If i run Debug one, I get: “the launch timed out and was terminated”.

Pittsburgh · September 16, 2008, 2:15pm

One thing I noticed while debugging is that if I change the function code to:

__device__ void GPU_CIRCUIT_KaiStoreDelay()

{

	unsigned int index = blockIdx.x*blockDim.x + threadIdx.x;

	OutputResult[index] = 123456789; // For debug only

	//OutputResult[index] = TD[threadIdx.x]; // TODO: Uncomment

}

I get 6.37592e-315 as a result from the GPU.

When a much smaller value was assigned to OutputResult[index] (such as ~17.9, which is what TD[threadIdx.x] was around) I get zero as the result.

If I further change the value assigned to OutputResult[index] from 123456789 to 999999999999999999, the output from the GPU is 7.73926e-315.

kyprizel - I also tried adjusting the block size and number of blocks, but no luck there either.

Pittsburgh · September 16, 2008, 2:18pm

I am running in debug mode and I do not get an error on the kernel launch, so I am still kind of stuck here. One thing I am going to try next is to reorder the inputs to the kernel… it shouldn’t matter, but maybe putting the output memory first will have some impact? I am just shooting in the dark at this point.

It shouldn’t matter that I allocate the memory on the device but never copy anything from the host to the device in the first place right? The values are only set on the GPU.

mfatica · September 16, 2008, 2:28pm

@Pittsburgh:
Since you are using doubles in your code, are you running on hardware that supports double precision? Are you compiling your code with -arch sm_13?

Pittsburgh · September 16, 2008, 2:33pm

mfatica,

Thanks for the reply. How can I know if I am running on hardware that supports double precision? I see the -arch option specifies the GPU architecture, but what is sm_13?

mfatica · September 16, 2008, 2:44pm

seviceQuery from the SDK will tell you the compute capability of your card:

Device 0: “GeForce 8600M GT”
Major revision number: 1
Minor revision number: 1

The minor revision number needs to be 3 to support double precision

_Big_Mac · September 16, 2008, 2:49pm

32 threads is the warp size. Threads within a warp are implicitly synchronized. This could suggest that your code lacks an explicit syncthreads somewhere that would also synchronize across multiple warps.

Pittsburgh · September 16, 2008, 3:10pm

I browsed by SDK directory, but I do not see the serviceQuery. How do I access this?

Thanks for the help.

Pittsburgh · September 16, 2008, 3:33pm

I searched posts and the NVCC guide and still I do not know what is sm_13. Can someone please explain to me how this relates to double precision? Is -arch sm_13 a command line option I need if I have an 8800GTS?

Thanks.

Pittsburgh · September 16, 2008, 3:55pm

I just changed all doubles to float in my device code and it is now returning results close to what I would expect. It looks like the 8800GTS doesn’t support double precision.

Ailleur · September 16, 2008, 3:57pm

The card itself (and all 8800’s) dont have double precision support, at the physical level. You cant use sm_13.

kyprizel · September 16, 2008, 5:17pm

Thank you very much for your help.
Fixed my problems, run code, 4 blocks of 32 threads per block on Nvidia 8600 GTS.
Get “too many resources requested for launch” if trying to increase this values.

P.S. DES implementation I posted here uses >50 registers, so I can’t increase count of blocks or threads. Now, when I know, that code works right i can begin optimisations. Thank you again :)

alex_dubinsky · September 16, 2008, 11:23pm

CUT_CHECK_ERROR has GOT to be changed in the SDK to check for errors in Release builds also. It trips many, many people up. It’s a foolish optimization to not check for errors on Release builds, as a matter of principle, but especially since kernels take far longer to run than the error check.

Everyone, please do not use CUT_CHECK_ERROR or go ahead and modify cutil.h.

E.D_Riedijk · September 17, 2008, 5:37am

it really is not a foolish optimization. It just needs an extra macro: CUT_ALWAYS_CHECK_ERROR. People developing code in release-mode, that is foolish.

E.D_Riedijk · September 17, 2008, 5:39am

You can run more blocks, but you cannot run more threads when you get a too many resources error. You can always use more blocks, they get serialized. If you calculate floor(8192/num_registers), you will have the maximum number of threads per block for your kernel. (or alternatively use the occupancy calculator)

alex_dubinsky · September 17, 2008, 5:55am

Developing on CUDA in Debug mode has zero benefit. You can’t debug, you just slow down your unit tests.

E.D_Riedijk · September 17, 2008, 9:21am

The only thing debug mode does (as far as I know) is enable these tests (CUT_CHECK_ERROR etc.) Does it do anything else also?

Reimar · September 17, 2008, 10:28am

What do you all consider “debug” mode? There is -DDEBUG which enables those macros, there is (for gcc) -g which enables debugging info and only slows down compilation, and there is the “debug” mode where you disable optimizations (IMO stupid, e.g. binaries compiled with gcc with -g3 and full optimizations can be debugged just fine in 90% of cases, while disabling optimizations hides a lot of bugs like stack trashing, undefined cases in C, compiler bugs, etc and generally you will not be testing the thing that will be “shipped” but something else).

kyprizel · September 17, 2008, 10:46am

8192 reg/60 reg32 thr4 bl ~ 512 registers.

so i should decrease number of registers used.

Topic		Replies	Views
Emulation works, Debug doesn't CUDA Programming and Performance	12	2808	January 29, 2010
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21836	September 2, 2008
Emulator works but G80 doesn't CUDA Programming and Performance	11	5568	July 3, 2007
Very strange behaviour. Maybe a bug...? Kernel fails to run strangely, but no errors are reported. CUDA Programming and Performance	5	1149	May 13, 2009
How to calculate register resource correctly? I met strange problem on calculate resource such as re CUDA Programming and Performance	2	1353	July 27, 2010
How to debug kernel throwing an exception? CUDA Programming and Performance	16	8211	June 14, 2013
Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev? CUDA Programming and Performance	15	13670	December 2, 2007
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4576	May 12, 2009
atomicadd for double precision in CUDA Fortran Legacy PGI Compilers	20	21854	November 15, 2013
No results are written to output buffer - why? CUDA Programming and Performance	34	20017	July 15, 2009

emu vs debug, different values

Related topics