Sorry for another post about how things work in emulation (almost) and don’t work on the GPU.
I’m using Windows XP, Visual Studio 2005, CUDA SDK and CUDA Toolkit 2.1, and a Quadro FX 4600.
My kernel uses doubles, but as I understand it, they are supposed to be converted to floats when the nvcc compiler goes away to work on the source code for about 15 minutes.
These this is the nvcc compiler command given for the debug build:
“$(CUDA_BIN_PATH)\nvcc.exe” -Xptxas –v -c -keep -arch=sm_10 -ccbin “$(VCInstallDir)bin” -DWIN32 -D_WINDOWS -D_DEBUG -D_USRDLL -D_WINDLL -D_UNICODE -DUNICODE -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd,/RTC1 -I"$(CUDA_INC_PATH)" -I.\ -I"$(NVSDKCUDA_ROOT)"\common\inc -o $(ConfigurationName)$(InputName).obj $(InputFileName)
For each pair of double precision input points, packaged as a double2, the kernel should return from about 30 to 364 double2’s (call that M), and a single unsigned char used as a bitmap. So, for an array of N double2 inputs, I should get an output array of NxM double2’s, and an array of N unsigned chars. All memory is cudaMalloc’ed. There is no intended usage of shared memory, just one read and two writes to global memory for each input. Each input double2 should be processed on its own thread. There is no thread-to-thread communication within a block. Right now, I’m limiting my input to a single double2, and so, I hope 1 single thread is doing the computation. My kernel is being called with <<<(1,1,1),(N,1,1)>>>.
The kernel is supposed to take an array of one or more, perhaps millions, double2’s with a latitude and longitude coordinate. The output is an array of double2’s consisting of a set of latitudes and longitudes that surround the input coordinate but are offset a continuous geodesic distance. The computation is done on a spheroid not a sphere. The kernel has to do a lot of computation to find each point that is at an azimuth and distance along a geodesic away from the input point. One device function is called M times from a for-loop for each input point, and the number of times the function is called varies, and the function itself contains a while-loop. Not a good candidate for loop unrolling. There are also lots of variables allocated “on the stack” (i.e. the CUDA meaning of that). No rogue “malloc” or “calloc” is present in the device code. Much of the device code is source from an existing C (not C++) library that was cut and pasted.
The output from the -Xptxas -v option is:
1>ptxas info : Compiling entry function ‘Z18make_point_buffersdddddllP7double2PhS0’
1>ptxas info : Used 58 registers, 452+392 bytes lmem, 76+72 bytes smem, 24 bytes cmem[0], 352 bytes cmem[1]
I’m thinking the smem is where the input parameters are kept, because I don’t move anything from global to shared memory.
This sequence of results for 1 input point from the output array after it was cudaMemcpy’d onto the host repeats over and over to the end of the output array. The output array size was 361. Too bad I can’t print out the output array while it is on the device, before it is cudaMemcpy’d over to the host. I find the repetition interesting - every 16 indexes it starts over.
Index: 0 X: -21990240944127.9960 Y: 0.0000
Index: 1 X: -21990240944127.9960 Y: 0.0000
Index: 2 X: -21990240944127.9960 Y: 0.0000
Index: 3 X: -21990240944127.9960 Y: 0.0005
Index: 4 X: -21990240944127.9960 Y: 107077806593965150000000000000000000000000000000000000000000
000000000000000.0000
Index: 5 X: -21990240944127.9960 Y: 248488731765822670000000000000000000000000000000000000000000
00000000000000000.0000
Index: 6 X: -21990240944127.9960 Y: 576649710375256380000000000000000000000000000000000000000000
00000000000000000000000000000.0000
Index: 7 X: -21990240944127.9960 Y: 133818331358929860000000000000000000000000000000000000000000
0000000000000000000000.0000
Index: 8 X: -21990240944127.9960 Y: -0.0000
Index: 9 X: -21990240944127.9960 Y: -0.0000
Index: 10 X: -21990240944127.9960 Y: -0.0000
Index: 11 X: -21990240944127.9960 Y: -0.1201
Index: 12 X: -21990240944127.9960 Y: -278660051125151400000000000000000000000000000000000000000000
00000000000000000.0000
Index: 13 X: -21990240944127.9960 Y: -646647081108696270000000000000000000000000000000000000000000
0000000000000000000000.0000
Index: 14 X: -21990240944127.9960 Y: -150057648353792230000000000000000000000000000000000000000000
0000000000000000000000000.0000
Index: 15 X: -21990240944127.9960 Y: -1.#QNB
Results from the emulator build are correct. I take this to mean that my indexing into the output array is correct. Unfortunately, it also seems to mean that I wrote (or copied) something that acts differently on the device than it does on the CPU.