emu vs debug, different values

I’ve ported some algo to CUDA, and on DeviceEmu it works ok, but when I compile Release version, result from kernel are completely different, even on CPU emulation.

I’ve tried to write simple program and result is the same, value got from thread is always = 0, looks like it doesn’t change.

Here is the code:

#define THREADS_PER_BLOCK 64

typedef unsigned long long long64;

typedef unsigned char u8;

typedef unsigned long u32;

#pragma pack(push, 1)

struct __align__(16) fake_u64

{

  u32 hi;

 Â u32 lo;

};

#pragma pack(pop)

typedef fake_u64 u64;

__global__ void CUDA_unit(u64 *shedule, u64 *d_result)

{

 Â u32 idx = blockIdx.x*blockDim.x + threadIdx.x;

 if (idx == 0) {

  Â Â d_result->hi = 1;

  Â Â d_result->lo = 2;

 Â }

 Â return;

}

int main(void)

{

 Â u64 Â *d_result;

 Â u64 Â *h_result;

 Â u64 Â *d_shedule;

 Â u64 Â *h_shedule;

 Â u64 Â in;

 CUT_DEVICE_INIT(NULL, NULL);

 in.hi = 1;

 Â in.lo = 2;

 Â shedule_size = 2;

 h_shedule = (u64 *)malloc(sizeof(u64)*shedule_size);

 Â if (!h_shedule) {

  Â Â printf("No memory on host for shedule array\n");

  Â Â return 0;

 Â }	

 h_result = (u64 *)malloc(sizeof(u64));

 Â if (!h_result) {

  Â Â printf("No memory on host for result\n");

  Â Â return 0;

 Â }	

 h_result->hi = 0;

 Â h_result->lo = 0;

 CUDA_SAFE_CALL( cudaMalloc((void**) &d_result, sizeof(u64)) );

 Â CUDA_SAFE_CALL( cudaMemcpy(d_result, h_result, sizeof(u64), cudaMemcpyHostToDevice) );

 Â 

 Â for (i=0; i<shedule_size; i++) {

  Â Â h_shedule[i] = in;

 Â }

 CUDA_SAFE_CALL(cudaMalloc((void **)&d_shedule, shedule_size*sizeof(u64)) );

 Â cudaMemcpy(d_shedule, h_shedule, shedule_size*sizeof(u64), cudaMemcpyHostToDevice);

 Â free(h_shedule);

 num_blocks = shedule_size / THREADS_PER_BLOCK + (shedule_size % THREADS_PER_BLOCK == 0 ? 0 : 1);

 CUDA_unit<<<num_blocks, shedule_size>>>(d_shedule, (u64 *)d_result);

 CUDA_SAFE_CALL( cudaThreadSynchronize() );

 CUT_CHECK_ERROR("CUDA_unit");

 CUDA_SAFE_CALL( cudaMemcpy(h_result, d_result, sizeof(u64), cudaMemcpyDeviceToHost) );

 Â 

 Â printf ("%08X:%08X\n", h_result->hi, h_result->lo);

 CUDA_SAFE_CALL( cudaFree(d_result) );

 Â CUDA_SAFE_CALL( cudaFree(d_shedule) )

 free(h_result);

 CUT_EXIT(NULL, NULL);

}

In my algo only one thread writes to d_result, so after calling cudaThreadSynchronize() there should be 0 or some value in d_result.

When i copy it to h_result, it’s always = 0 in Release mode and normal in DebugEmu mode.

I’ve tried to disable optimisation, disable loops unrollment, change memory alignment, no result. Where is my mistake?

P.S. env = VC2005, CUDA 2.0

What does align(16) do? You’re malloc’ing left and right, that requirement can’t possibly be upheld. If the compiler does anything but ignore it, it will cause wrong results.

Forget the pack push/pop nonsense also. Use this:
union u64 {
unsigned long long whole;
struct
{
u32 lo; // little-endian?
u32 hi;
};
};

Also, CUT_CHECK_ERROR should be right after the kernel call, not after synchronize.

thank you very much for your answer.

align is from CUDA Programming guide … my mistake is alignment size :(

I’ve put error checking right after kernel execution,

anyway - no errors, but also no result received from kernel.

h_result is always eq 0 :(

P.S. oh, I think i’m quite stupid b/c i’ve tried to run kernel on box w/o CUDA device and wanted to get result from it :(

anyway, when I run my algo on CUDA my results are different from emulator ones.

is this work difference described somewhere?

CUT_ERROR_CHECK only works in Debug builds. It would have reported the lack of CUDA. I suggest to take a look at the code that macro calls and make your own macro that doesn’t depend on Debug.

There are many little differences between emulator and device. I don’t think they are documented anywhere, except in discussions on this forum.

after some experiments found, that problem in this code:

http://pastebin.com/m46c18a01

It’s bitslice implementation of DES algo from distributed.net client.

Nothing special, but seems like some GPU operations going not equal than CPU ones.

Can you tell me what can go wrong - there’re simple logical operations only.

I know for ints, right-shift behaves differently. For floats, all the single-cycle trig, etc. functions have nuances that don’t get emulated. XOR, OR, and INV I’d think would be ok, but I never really tested them. I would take out the early returns so the compiler has less on its mind, but they shouldn’t break anything.

I can’t really see anything wrong with your code, but there’s also parts of it missing (like the calling function).

Anyway, there is a straighforward procedure for debugging this. Comment out most of your code, hopefully the results will become the same. Uncomment till you see something diverge. Keep in mind that uncommented code might still be removed by the compiler if its results aren’t used. I’d start by commenting lines 698-872 and comparing. Then uncomment line 871, then 698.

In about 99% of cases using long is a very stupid idea, it can be either 32 or 64 bit, and I am not really sure the NVidia and host compiler always agree on which it is.

I’d suggest including stdint.h and use either uint32_t or uint64_t, depending on which is actually the correct one.

To be honest, I think that code is a particularly horrible implementation even for DES, but that is just an opinion - has anyone actually tested this is faster on the GPU than the lookup-table based approach? Note that at least before the GTX2xx generation GPUs are only 32 bit systems, and even for the new ones there is only a hand full of 64 bit units.

Hmm, that’s something I didn’t know. On windows “long int” is always 32 bits, but on linux it seems it can be 64bit if on x64. However, I’m not sure including stdint.h is the solution, since that targets the host platform not the gpu. What is nvcc’s behavior in all this?

To get a consistent 64-bit int on all platforms, use ‘long long’. To get a consistent 32-bit, ‘int’ will work on all the ILP32, LP64, and LLP64 conventions (researching all this, it seems no one uses 64bit ints, ie ILP64, except a few old Crays). Again, I have a feeling this is a better solution that rendering yourself to the whims of stdint.h (which of course uses some unknown permutation of ‘long’ and ‘int’ internally).

NVIDIA should really think through these host/device platform differences and provide a full solution. Stuff now is either broken or hacked. E.g., the GPU uses 64bits for all its pointers on 64bit systems for compatibility, needlessly wasting memory and registers. There should be GPU datatypes and GPU sizeof() etc operators that can be used in Host code (which has to allocate the memory).

Btw, pre-GT200 chips can do 64bit ints just fine (It’s part of the PTX spec. I think the chips just issue multiple ops). GT200’s 64-bit floats are not related to a discussion of integers (and I think GT200 still uses tricks).

In CUDA, we generally guarantee* that for any basic type like that host sizeof = device sizeof (which is why you can move pointers from host to device and vice-versa). I’m not really sure why you think this is a problem, Alex, considering it lets you marshal data for copying to the GPU in a very straightforward fashion (which lets you handle irregular datasets). Using GPU datatypes is almost certainly a worse hack; your code probably won’t run all that much faster, and the time to develop it will increase significantly (and debugging it in places will become awful–“oops, I used sizeof() instead of gpu_sizeof()” would take an insane amount of time to debug).

kyprizel, can you give me some input that shows different results on device versus emu (as well as what device you’re testing it on)?

  • there is at least one caveat, see below

Alright, I can see the benefit of such simplification if you can guarantee that it’ll always just work 100%. But there’s a lot of nuances with alignment, double/float, 64bit longs, etc, etc that I’m not convinced yet. Anyway, it’d be very interesting to see if kyprizel’s bug is in any way related.

As I often do, I spoke slightly too soon! Double/float on non-SM1.3 hardware is the only problem I know of besides structs and potentially different alignment between nvcc and the host compiler (though I haven’t run into that myself and don’t know much about the problem). So, longs are not a problem, and if you don’t try to use doubles except on GT200, you won’t have a problem there either. We are working on the float/double problem to make it behave more sensibly in the future, though.

(Double/float problem is: arrays of doubles behave strangely on non-DP hardware.)

What’s the problem with structs?

There have been claims on here that there have been problems with alignment of structs. However, I’ve used structs extensively and never encountered a problem, I’ve never seen a bug repro case with a struct alignment problem, and as far as I know all of the problems reported on the boards have been due to other issues in the code. So I don’t think the claims are accurate in general–I just didn’t want to swear up and down that sizeof(struct) on the host is absolutely always equal to sizeof(struct) on the device when the compiler team has not explicitly told me that there’s not some bizarre corner case involving gcc attributes that doesn’t work or something equally ridiculous.

(I was explicitly told about sizeof(long), though, and I know pointers work as well. So, with the one exception with doubles/floats on non-SM1.3 hardware, I honestly don’t know of any other issues. If you find any, let us know.)

Is CUDA’s own align attribute supported?

Supported by what? nvcc? Yes.

By the sizeof(), by the host, by the cudamemcpy.

E.g.:

struct align(16) point { int a[3]; };

struct myData { point hostArray[100]; };

int a = offsetof( struct myData, hostArray[1].a[0] );

EDIT: yes, that works. cool.

Yes, that is exactly why I am very suspicious about the code, it seems to me that it will not work exactly as expected at least on some platform (personally, I would assume it is not working for x64 Windows).

It is also why I would start with a non-optimized DES, the algorithm is obfuscated enough without optimization, the optimization is not very likely to work too well on the GPU, and it is easier to start from a simple, working implementation.

Also some people claimed that the compiler might produce wrong code with 64 bit instructions - but we miss some information about which compiler is used and whether 32 or 64 bit code is generated to know if this even can be the reason.

kyprizel,

Did you by any chance ever figure out the case of the problem? I am having the exact same problem as you where EmuDebug shows correct results for my output, but actually running in Debug/Release outputs the result as all zeros.

Here are portions of my code where the kernel call occurs:

.

.

unsigned int NUM_THREADS = BLOCK_SIZE*NUM_BLOCKS;

.

.

// Allocate memory on the host to store results from the GPU

	unsigned int OutputResult_Size = NUM_THREADS*sizeof(double);

	double* h_OutputResult = (double*) malloc(OutputResult_Size);

	// Allocate memory on the device to store delay results

	double* d_OutputResult;

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_OutputResult, OutputResult_Size));

.

.

	DefectSim_kernel<<< NUM_BLOCKS, BLOCK_SIZE >>> (d_SeedArray, d_Circuit, d_Circuit_Netlist, d_POlist, d_Thread_Netlist, d_Circuit_DelayLibrary, d_DelayVariable, d_CurrentSample, d_InterFlag, d_InterVariable, d_IDs_Array, d_Queue, d_QueuePtrs, d__KaiPreSampledDelay, d_num_double_values, d_Pattern, d__KaiPttns, d_Thread_inlist, d_OutputResult);

.

.

	// Copy the results from the device to the host

	CUDA_SAFE_CALL(cudaMemcpy(h_OutputResult, d_OutputResult, OutputResult_Size, cudaMemcpyDeviceToHost));

	// Print the delay results

	for(int i=0; i < NUM_THREADS; i++)

	{

 Â cout << "# " << h_OutputResult[i] << " " << i << endl;

	}

.

.

Here are portions of the kernel code:

__shared__ double TD[BLOCK_SIZE];

__device__ double* OutputResult;

__device__ void GPU_CIRCUIT_KaiStoreDelay()

{

	unsigned int index = blockIdx.x*blockDim.x + threadIdx.x;

	OutputResult[index] = TD[threadIdx.x];

}

__global__ void DefectSim_kernel(unsigned short* g_SeedArray, GPU_CIRCUIT* g_Circuit, GPU_GATE* g_Circuit_Netlist, GPU_GATE** g_POlist, GPU_GATE* g_Thread_Netlist, double* g_Circuit_DelayLibrary, double* g_DelayVariable, double* g_CurrentSample, bool* g_InterFlag, GPU_NRANDOM* g_InterVariable, unsigned int* g_IDs_Array, GPU_GATE** g_Queue, GPU_GATE*** g_QueuePtrs, double* g__KaiPreSampledDelay, unsigned int* g_num_double_values, GPU_PATTERN* g_Pattern, int* g__KaiPttns, GPU_GATE** g_Thread_inlist, double* g_OutputResult) 

{

.

.

	// Set OutputResult pointer

	OutputResult = g_OutputResult;

.

.

::TD[threadIdx.x] set somewhere in the device code prior to function call below::

.

.

GPU_CIRCUIT_KaiStoreDelay();

.

.

}

Are you checking for errors after kernel calls, like in the SDK samples?

I found out that code works ok with small count of threads(32) and blocks(1).
I thought that code should work slow, but if i increase count of threads or blocks - it doesn’t work at all.