cudaThreadSynchronize vs. cudaDeviceSynchronize what is the difference?

What is the difference between cudaThreadSynchronize and cudaDeviceSynchronize? It seem like a lot of example programs use cudaThreadSynchroniz. But recent NVidia documentation says

So, I used cudaDeviceSynchronize in my program. After seeing that it crashes, I switched to cudaThreadSynchronize and now it doesn’t crash.

Is anybody here less befuddled than I?

post some source?

Don’t know if this much of a snippet will shed much light, but here goes

float maxVal=0;

	size_t index = cublasIsamax( arrayLen*nMagArrays, amps, 1 )-1;

 	GpuToHost( amps+index, &maxVal, 1 );

 	cudaDeviceSynchronize(); // <- the questionable line

GpuToHost is a function i wrote, which goes like this:

inline void GpuToHost( const float * devPtr, float * hostPtr, size_t len )

{

	cudaError_t errCode = cudaMemcpy( hostPtr, devPtr, len*sizeof(float), cudaMemcpyDeviceToHost );

	if( errCode == cudaErrorInvalidDevicePointer )

		throw GpuException( "cudaErrorInvalidDevicePointer error in GpuToHost" );

	else if( errCode != cudaSuccess )

		throw GpuException( "error in GpuToHost" );

}

Anyway, with no explicit synchronization line, I get occasional bad results. With cudaThreadSynchronize, I get reliable results. With cudaDeviceSynchronize, it crashes later (eventually I get an error from GpuToHost()).

This is all with toolkit 4.0 and a recent driver installed, using a 560 gtx ti. On other machines, with older cards, toolkits, and drivers, I get along fine without any explicit sync.

I guess this was a tougher question than I thought. Setting aside the issue of why my program crashed, does anybody know what is the intended difference between these two CUDA calls? The documentation is not clear on this point.

Also, I have seen pages that say that cudaMemcpy from device to host is always synchronous. Yet, I think I’ve seen other pages that say it may be asynchronous below a certain transfer size. Which one is true?

Doesn’t the programming guide say that cudaMemcpy of sizes below 64K are asynchronous by default?

DtoH cudaMemcpys are always synchronous. HtoD cudaMemcpys will return once the source buffer can be modified without impacting the copy.

Could nVidia make it more clear then in next revision?

In documentation “host ↔ device” line presents and “<->” sign meaning in many common cases is “both directions”.

If nVidia can’t use words instead of signs could it use sign “->” at least?