Emulation/CPU=correct,Execution/GPU=incorrect emulation

chrismc · September 1, 2008, 10:31am

Not necessarily. At the moment I have only one thread perform one function while the rest wait (I haven’t yet thought how that function can be parallelised). I though the lack of syncthreads would be shown in the emulation anyway.

__syncthreads synchronises threads in a block, not in a warp, is that correct? BUt threads are executed in warps ie groups of 32?

So when working with 400 threads/particles in 1 block and I place a __syncthreads call after a function call in a kernel, not one thread should pass that __syncthreads until all other 399 have reached that particular __syncthreads? So with a maximum of 128 threads executing concurrently on a C870, ie 4 warps, with 400 threads not all threads are being computed concurrently.

I’ll try multiples of warps on both machines. I expect something interesting to happen with 5 warps.

[snapback]432808[/snapback]

results from both the parallel machine and the C870 agree upto and including 6 warps, but from 7 warps onwards the results from the C870 show unexpected behaviour while the parallel machine results are as expected.

Plus there is an unexpected performance gain at warps>=7, in that the execution time drops dramatically from seconds to fractions of a millisecond.

Any suggestions?

E.D_Riedijk · September 1, 2008, 12:49pm

check for errors after your kernel launch, you probably have to many registers for 7 warps per block. I think you are experiencing a too many resources requested error.

chrismc · September 1, 2008, 12:58pm

I’m beginning to suspect it could be something like that.

How do I check for the error you mention? From the example projects I have in my code

////////////////////////////////////

// Launch the device computation

SPH<<<dimGrid, dimBlock>>>(//variables);

CUT_CHECK_ERROR(“kernel error”);

////////////////////////////////////

Do I need to write something else?

theMarix · September 1, 2008, 2:37pm

Try to cudaThreadSynchronize before checking for the error value. (Actually cudaThreadSynchronize will report the error if there is one.)

cbuchner1 · September 1, 2008, 9:57pm

I just tripped over the fact that __sinf only works within [-pi, pi] whereas in Emu mode it works correctly at all times. Similarly with __cosf, I suppose. This screwed over my Hough Transform. ;)

alex_dubinsky · September 1, 2008, 10:14pm

Don’t forget to run in Debug mode. Check the custom build step of the sample in case anything doesn’t work.

Sylvain_Collange · September 2, 2008, 5:01pm

More precisely, the absolute error of __sinf and __cosf is only guaranteed within [-pi, pi]. Outside this range, it will be less accurate, but still usable for most purposes (well, maybe not yours!)

You can think of __sinf(x) as behaving much like sinf(fmodf(x, (float)TWO_PI)). Only for very large inputs it will give a completely wrong answer in absolute terms.

But even inside [-pi, pi], CPU implementations can be much more accurate than __sinf and __cosf, if accurate means a smaller relative error (think __sinf(x) when x is very close to 0).