Another "results different from emulation and GPU" Data-parallel, reading and writing to g

Sorry for another post about how things work in emulation (almost) and don’t work on the GPU.

I’m using Windows XP, Visual Studio 2005, CUDA SDK and CUDA Toolkit 2.1, and a Quadro FX 4600.

My kernel uses doubles, but as I understand it, they are supposed to be converted to floats when the nvcc compiler goes away to work on the source code for about 15 minutes.

These this is the nvcc compiler command given for the debug build:

(CUDA_BIN_PATH)\nvcc.exe" -Xptxas –v -c -keep -arch=sm_10 -ccbin "(VCInstallDir)bin” -DWIN32 -D_WINDOWS -D_DEBUG -D_USRDLL -D_WINDLL -D_UNICODE -DUNICODE -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd,/RTC1 -I"(CUDA_INC_PATH)" -I.\ -I"(NVSDKCUDA_ROOT)"\common\inc -o (ConfigurationName)\$(InputName).obj (InputFileName)

For each pair of double precision input points, packaged as a double2, the kernel should return from about 30 to 364 double2’s (call that M), and a single unsigned char used as a bitmap. So, for an array of N double2 inputs, I should get an output array of NxM double2’s, and an array of N unsigned chars. All memory is cudaMalloc’ed. There is no intended usage of shared memory, just one read and two writes to global memory for each input. Each input double2 should be processed on its own thread. There is no thread-to-thread communication within a block. Right now, I’m limiting my input to a single double2, and so, I hope 1 single thread is doing the computation. My kernel is being called with <<<(1,1,1),(N,1,1)>>>.

The kernel is supposed to take an array of one or more, perhaps millions, double2’s with a latitude and longitude coordinate. The output is an array of double2’s consisting of a set of latitudes and longitudes that surround the input coordinate but are offset a continuous geodesic distance. The computation is done on a spheroid not a sphere. The kernel has to do a lot of computation to find each point that is at an azimuth and distance along a geodesic away from the input point. One device function is called M times from a for-loop for each input point, and the number of times the function is called varies, and the function itself contains a while-loop. Not a good candidate for loop unrolling. There are also lots of variables allocated “on the stack” (i.e. the CUDA meaning of that). No rogue “malloc” or “calloc” is present in the device code. Much of the device code is source from an existing C (not C++) library that was cut and pasted.

The output from the -Xptxas -v option is:
1>ptxas info : Compiling entry function ‘Z18make_point_buffersdddddllP7double2PhS0
1>ptxas info : Used 58 registers, 452+392 bytes lmem, 76+72 bytes smem, 24 bytes cmem[0], 352 bytes cmem[1]

I’m thinking the smem is where the input parameters are kept, because I don’t move anything from global to shared memory.

This sequence of results for 1 input point from the output array after it was cudaMemcpy’d onto the host repeats over and over to the end of the output array. The output array size was 361. Too bad I can’t print out the output array while it is on the device, before it is cudaMemcpy’d over to the host. I find the repetition interesting - every 16 indexes it starts over.

Index: 0 X: -21990240944127.9960 Y: 0.0000
Index: 1 X: -21990240944127.9960 Y: 0.0000
Index: 2 X: -21990240944127.9960 Y: 0.0000
Index: 3 X: -21990240944127.9960 Y: 0.0005
Index: 4 X: -21990240944127.9960 Y: 107077806593965150000000000000000000000000000000000000000000
000000000000000.0000
Index: 5 X: -21990240944127.9960 Y: 248488731765822670000000000000000000000000000000000000000000
00000000000000000.0000
Index: 6 X: -21990240944127.9960 Y: 576649710375256380000000000000000000000000000000000000000000
00000000000000000000000000000.0000
Index: 7 X: -21990240944127.9960 Y: 133818331358929860000000000000000000000000000000000000000000
0000000000000000000000.0000
Index: 8 X: -21990240944127.9960 Y: -0.0000
Index: 9 X: -21990240944127.9960 Y: -0.0000
Index: 10 X: -21990240944127.9960 Y: -0.0000
Index: 11 X: -21990240944127.9960 Y: -0.1201
Index: 12 X: -21990240944127.9960 Y: -278660051125151400000000000000000000000000000000000000000000
00000000000000000.0000
Index: 13 X: -21990240944127.9960 Y: -646647081108696270000000000000000000000000000000000000000000
0000000000000000000000.0000
Index: 14 X: -21990240944127.9960 Y: -150057648353792230000000000000000000000000000000000000000000
0000000000000000000000000.0000
Index: 15 X: -21990240944127.9960 Y: -1.#QNB

Results from the emulator build are correct. I take this to mean that my indexing into the output array is correct. Unfortunately, it also seems to mean that I wrote (or copied) something that acts differently on the device than it does on the CPU.

If you ever pass doubles to a kernel, you have to compile with -arch sm_13. There is implicit conversion of doubles declared on the device to floats, but there is no conversion of doubles passed to the device (at least not to anything correct).

Well, that might take care of it. Thanks. Hard for me to pick that up from the Programming Guide, but I’ve only read it about 20 times, and it takes a long time for things to sink in.

when you need to use arch sm_13 and when you don’t is not clear at the moment, I agree. it’s something we’ve asked to be improved in 2.3.

I’ve attempted to recompile using -arch_13. This is what I get:

1>ptxas info : Compiling entry function ‘Z18make_point_buffersdddddllP7double2PhS0
1>ptxas fatal : Memory allocation failure

I also get this same error when I try to compile the same code I’ve moved over to a machine with a FX 4800.

Can you post source? I can try with 2.2 or file a bug if it doesn’t work.

I could post my code, and suffer the embarrassment of having someone else see it, but the source code I took from our code base would be another matter. I could probably get the OK to give it to someone at NVIDIA, because I think we have a business relationship with you. There was a person from NVIDIA here a couple weeks ago talking to our systems folks.