emu vs debug, different values

I have the following after my kernel call. Is this what you mean? I do not get an error on kernel launch.

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

It’s my mistake - i was trying to track errors in Release version, If i run Debug one, I get: “the launch timed out and was terminated”.

One thing I noticed while debugging is that if I change the function code to:

__device__ void GPU_CIRCUIT_KaiStoreDelay()

{

	unsigned int index = blockIdx.x*blockDim.x + threadIdx.x;

	OutputResult[index] = 123456789; // For debug only

	//OutputResult[index] = TD[threadIdx.x]; // TODO: Uncomment

}

I get 6.37592e-315 as a result from the GPU.

When a much smaller value was assigned to OutputResult[index] (such as ~17.9, which is what TD[threadIdx.x] was around) I get zero as the result.

If I further change the value assigned to OutputResult[index] from 123456789 to 999999999999999999, the output from the GPU is 7.73926e-315.

kyprizel - I also tried adjusting the block size and number of blocks, but no luck there either.

I am running in debug mode and I do not get an error on the kernel launch, so I am still kind of stuck here. One thing I am going to try next is to reorder the inputs to the kernel… it shouldn’t matter, but maybe putting the output memory first will have some impact? I am just shooting in the dark at this point.

It shouldn’t matter that I allocate the memory on the device but never copy anything from the host to the device in the first place right? The values are only set on the GPU.

@Pittsburgh:
Since you are using doubles in your code, are you running on hardware that supports double precision? Are you compiling your code with -arch sm_13?

mfatica,

Thanks for the reply. How can I know if I am running on hardware that supports double precision? I see the -arch option specifies the GPU architecture, but what is sm_13?

seviceQuery from the SDK will tell you the compute capability of your card:

Device 0: “GeForce 8600M GT”
Major revision number: 1
Minor revision number: 1

The minor revision number needs to be 3 to support double precision

32 threads is the warp size. Threads within a warp are implicitly synchronized. This could suggest that your code lacks an explicit syncthreads somewhere that would also synchronize across multiple warps.

I browsed by SDK directory, but I do not see the serviceQuery. How do I access this?

Thanks for the help.

I searched posts and the NVCC guide and still I do not know what is sm_13. Can someone please explain to me how this relates to double precision? Is -arch sm_13 a command line option I need if I have an 8800GTS?

Thanks.

I just changed all doubles to float in my device code and it is now returning results close to what I would expect. It looks like the 8800GTS doesn’t support double precision.

The card itself (and all 8800’s) dont have double precision support, at the physical level. You cant use sm_13.

Thank you very much for your help.
Fixed my problems, run code, 4 blocks of 32 threads per block on Nvidia 8600 GTS.
Get “too many resources requested for launch” if trying to increase this values.

P.S. DES implementation I posted here uses >50 registers, so I can’t increase count of blocks or threads. Now, when I know, that code works right i can begin optimisations. Thank you again :)

CUT_CHECK_ERROR has GOT to be changed in the SDK to check for errors in Release builds also. It trips many, many people up. It’s a foolish optimization to not check for errors on Release builds, as a matter of principle, but especially since kernels take far longer to run than the error check.

Everyone, please do not use CUT_CHECK_ERROR or go ahead and modify cutil.h.

it really is not a foolish optimization. It just needs an extra macro: CUT_ALWAYS_CHECK_ERROR. People developing code in release-mode, that is foolish.

You can run more blocks, but you cannot run more threads when you get a too many resources error. You can always use more blocks, they get serialized. If you calculate floor(8192/num_registers), you will have the maximum number of threads per block for your kernel. (or alternatively use the occupancy calculator)

Developing on CUDA in Debug mode has zero benefit. You can’t debug, you just slow down your unit tests.

The only thing debug mode does (as far as I know) is enable these tests (CUT_CHECK_ERROR etc.) Does it do anything else also?

What do you all consider “debug” mode? There is -DDEBUG which enables those macros, there is (for gcc) -g which enables debugging info and only slows down compilation, and there is the “debug” mode where you disable optimizations (IMO stupid, e.g. binaries compiled with gcc with -g3 and full optimizations can be debugged just fine in 90% of cases, while disabling optimizations hides a lot of bugs like stack trashing, undefined cases in C, compiler bugs, etc and generally you will not be testing the thing that will be “shipped” but something else).

8192 reg/60 reg32 thr4 bl ~ 512 registers.

so i should decrease number of registers used.