program works on macs, not windows xp/vista

Hello,
I have a CUDA program that runs fine and works on a Mac and Macbook Pro with Leopard, but when the same computers are booted into windows (xp or vista), it no longer returns the correct results. Emulation mode, however, works in all cases, making it difficult to track down the problem. On windows, I am using CUDA v2.1 with Visual Studio 2008, and version 181.22 of the NVIDIA drivers (for a 8800 GT card); I also tried 181.20 and 180.48 as well. There are no CUDA errors I can find: there are error checks after every CUDA-related call in the program, including memory allocation, data transfer to and from the card, and the kernel call itself. The code itself is pretty extensive, and I don’t think it would help to upload it necessarily, since it IS working perfectly on the mac, but I can if it would help.

So, any suggestions for this? Is there are reason device mode would be different on the mac from windows, aside from the random nature of the order in which threads are used? The SDK tests that I tried worked, even after recompiling them, so I’m at a bit of a loss at this point.

Thanks!
Adam

What results are you expecting? What is the difference between what Mac returns and Windows returns?

What compiler are you using on Mac? What compiler on Windows? What are the build settings for each?

How many times have you repeated your test? Have you checked for a memory leak?

Well, to get into the details, the CUDA kernel more or less returns the coefficients of filter, whose frequency response is an estimate of the power in the data (this is an auto-regressive model using the Burg algorithm). All that matters is that we should be getting the coefficients; for a 16-order model, we get back 17 coefficients, with a 1.0 followed by 16 others. On the Mac in device and emulation modes, and on windows in Emulation mode, the correct coefficients are returned (e.g., 1.0 -0.78 1.95 …). In windows device mode, it returns a 1 followed by all 0s (1.0 0.0 0.0 0.0 …).

On the mac, I am using g++; on windows, it is visual studio 2008 (c++). Neither configuration have any optimizations enabled. When I use emulation, I just add -deviceemu for VS and g++, and add -g for debugging each. (Debugging WITHOUT emulation returns bad results as well). As far as I can tell, there are no memory leaks, and the correct amount of shared memory is being declared (as floating pt, not doubles), etc. The test itself is semi-random (a sine wave with added random noise), but regardless of the noise in the test, the bad coefficients are returned.

Thanks for the help! Hopefully someone thinks of something, since this is driving me crazy.

Adam

and what does cudaThreadSynchronize() return? Have you run it through valgrind with -deviceemu?

valgrind? Am I missing something? I thought valgrind was a Linux only tool. I thought the target operating systems were Mac OS X and Windows.

From: http://valgrind.org/

“It runs on the following platforms: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux.”

To provide an additional data point, I would try compiling in release build on Windows with and without emulation. Using debug builds can mask certain errors. In particular it can mask stack corruptions (Coding Relic: Premature Optimization for Fun and Profit). If the code provides correct results with a release build that would be an interesting data point.

The fact that you get 0s tends to indicate something fundamentally wrong. Perhaps some code that you think is executing is not executing and thus leaving data initialized to 0.

How are you getting “semi random” data? Is this random data an input? Can you use a known input so that the answer is the same every time the code is executed?

Good luck getting to the bottom of this.

For #1 - I am not running it on linux, so valgrind is not available to me…except that a pretty functional Mac version is available. There seemed to be a few issues with it regarding the cudaMalloc function, where it reported errors (I think this is reported elsewhere on this forum as well…). Regardless, there were no memory leaks.

For #2 - I have compiled in every mode possible in windows: release+device, release+emulation, debug+emulation, debug+device. It works when emulation is used, it does not when the device is used. It works regardless on the mac.

My semi-random data is just a segment of a sine wave with a small amount of random noise added to it. The results are the same regardless of whether the noise is there or not (that is, reasonable coefficients are returned in emulation but not device mode; they are slightly different though each time).

Ok, I think it is solved. I changed the way I access dynamically allocated shared memory so that it mimicked exactly what is in the programming guide on page 22. Here is what I had:

[codebox]

/*

shared mem for a block is allocated as 3lensizeof(float)

*/

extern shared float array;

global void function_kernel(int len){

int sharedOffset=0;

float* a = (float*)array;

sharedOffset += len;

float *b = (float*)&array[sharedOffset];

sharedOffset += len;

float *c = (float*)&array[sharedOffset];

}

[/codebox]

Here is what I changed it to, a la the guide:

[codebox]

/*

shared mem for a block is allocated as 3lensizeof(float)

*/

extern shared float array;

global void function_kernel(int len){

float* a = (float*)array;

float *b = (float*)&a[len];

float *c = (float*)&b[len];

}

[/codebox]

What I don’t understand is why both versions work on the mac, but only the 2nd works on windows. I’m going to investigate a little more, and maybe someone will catch something here I missed.

Thanks for the help!

[codebox]

/*

shared mem for a block is allocated as 3lensizeof(float)

*/

extern shared float array;

global void function_kernel(int len){

int sharedOffset=0;

float* a = (float*)array;

sharedOffset += len;

float *b = (float*)&array[sharedOffset];

sharedOffset += len;

float *c = (float*)&array[sharedOffset];

}

[/codebox]

Here is what I changed it to, a la the guide:

[codebox]

/*

shared mem for a block is allocated as 3lensizeof(float)

*/

extern shared float array;

global void function_kernel(int len){

float* a = (float*)array;

float *b = (float*)&a[len];

float *c = (float*)&b[len];

}

[/codebox]

What I don’t understand is why both versions work on the mac, but only the 2nd works on windows. I’m going to investigate a little more, and maybe someone will catch something here I missed.

Thanks for the help!

[/quote]

Not entirely sure, as I have just tested my theories on some of my code and couldn’t get that error (using CUDA 2.1 now though), but here’s some things I thought might have caused a similar issue in some code I wrote a while ago (CUDA 1.1).

  1. declare all variables at the start of the method.

  2. use var = var + x; instead of var += x; (no idea why, but that did seem to fix one problem I had.)

The problem you have reported is quite strange. It ideally should work both on windows,linux and MAC.

I have a similar problem that my program works on Mac but not on Linux. Quite strange, and further I do not even use any shared mem.
[url=“http://forums.nvidia.com/index.php?showtopic=89775”]http://forums.nvidia.com/index.php?showtopic=89775[/url]

Have you been able to track down the root cause of this issue? For example, is it possible to examine the assembly code that is generated to determine the differences in how the code was compiled? Do you think it is a compiler limitation in Nvidia’s compiler? Has the problem happened again since you made your code change?