inconsistent results on every run

Hey guys!

Im fairly new to GPGPU, but been learning alot for the past 2 months. And now I need a little help on my project that im currently working on.

Basically i’ve wrote some simple kernels that are sort of a “map” functions taking in a matrix of data (being my DB) and a target array of data which being compared spits out values depending on the comparison of two values at the same positions. Now here’s the deal, i’ve been running it on different amounts of data (1st try - the first 100000 entries in the DB, 2nd try - 500000 entries in the DB), and apparently my program would execute one of the kernels correctly, but the others produce various results from run-to-run.
I’m synchronizing almost everywhere in my kernels in order not to run into unexpected race-conditions, also i’ve adapted the use of shared memory in my kernels, but the results produced - are not the ones im looking for.
Every kernel runs on its own dimensions and is synchronized after each evocation.

Im really hoping if anyone else had the same issues (because the my code seems correct), could explain or at least point me into the right direction of resolving this issue im struggling with.

Thanks a bunch!

After running tests in cuda-gdb, turns out that one of the kernels just doesn’t run when called. Can’t figure out why though. Any ideas, community?!

Are you getting any error codes?

printf(“\n Error msg: %s”, cudaGetErrorString(cudaGetLastError()));

Commonly you can get a launch failure if you have configured your kernel incorrectly. Perhaps by too many threads or blocks or by each block consuming too much resources. If you posted some code snippets I’m sure someone could give you some helpful hints!

Regards,

Ahh, thanks for the prompt reply!

Indeed after i checked for error - i got “Error msg: invalid configuration argument” before the kernel evocation;

i assume that the problem is in the dimensions that im calling my kernel? Correct me if im wrong.

__global__ void some_action (  const int *test,

                      const int *tar,

                       float *out)

   {

       unsigned int xindex = blockIdx.x * blockDim.x + threadIdx.x;

       unsigned int yindex = (blockIdx.y * blockDim.y + threadIdx.y);

       unsigned int elem = threadIdx.x;

       unsigned int row = threadIdx.y;

       unsigned int din = yindex*M+xindex; // M = width of matrix

__shared__ int shared_data[M][M+1]; // tile M by M

       __shared__ int shared_tar[M];

shared_data[row][elem] = test[din];

       __syncthreads();

       shared_tar[elem] = tar[elem];

       __syncthreads();

if  ((elem%M == 0)||        //thread control

           (elem%M == 2)||

           (elem%M == 4)||

           (elem%M == 6)||

           (elem%M == 8)||

           (elem%M == 10)||

           (elem%M == 12)||

           (elem%M == 14)||

           (elem%M == 16)||

           (elem%M == 18)||

           (elem%M == 20)||

           (elem%M == 22))

       {

           if ((shared_tar[elem+1] == shared_data[row][elem+1]) && ((shared_tar[elem+1] != 0) || (shared_data[row][elem+1] != 0)))

           {

               out[din] = ((400.0f)/(max_c(shared_tar[elem], shared_data[row][elem])));// max_c a __device__ function returning max value out of the two arguments

           }

           else if (((shared_tar[elem+1] == 20) && (shared_data[row][elem+1] == 30))||

               ((shared_tar[elem+1] == 30) && (shared_data[row][elem+1] == 20)))

           {

               out[din] = ((300.0f)/(max_c(shared_tar[elem], shared_data[row][elem])));

           }

           else if (((shared_tar[elem+1] == 30) && (shared_data[row][elem+1] == 10))||

               ((shared_tar[elem+1] == 10) && (shared_data[row][elem+1] == 30)))

           {

               out[din] = ((200.0f)/(max_c(shared_tar[elem], shared_data[row][elem])));

           }

           else if (((shared_tar[elem+1] == 20) && (shared_data[row][elem+1] == 10))||

               ((shared_tar[elem+1] == 10) && (shared_data[row][elem+1] == 20)))

           {

               out[din] = ((100.0f)/(max_c(shared_tar[elem], shared_data[row][elem])));

           }

           else {out[din] = 0.0f;}

           __syncthreads();

       }

       else {out[din] = 0.0f;

       __syncthreads();}

   }

and im calling this kernel on CPU with following dimensions:

dim3 threads_map (M,M);

 dim3 blocks_map (1, rows/threads_map.y); // row being a local variable counting number of rows in a matrix

some_action<<<blocks_map, threads_map>>> ( dev_a, dev_b, dev_c);

hope that information helps somehow.

Thanks one more time mates!

What are the dimensions of ‘M’ and rows?

M = 32

rows are dynamic - at the moment counting up to 572387 entries

Ok, what is you GPU model? ( I need to know the compute capability) 1024 threads / block is not supported by older GPUs.

Could you also compile the kernel with verbose flag on ? So we could see the resource usage of that particular kernel. If your register usage is high that might also explain it.

im working with a Quadro 600 (supports 2.1) but compiling everything with -arch=sm_20 due the fact that i will use Tesla once im done with this project.

[root@serv-dev02a work]# nvcc -arch=sm_20 -v main_beta4_1.cu -o main

#$ _SPACE_=

#$ _CUDART_=cudart

#$ _HERE_=/opt/cuda/bin

#$ _THERE_=/opt/cuda/bin

#$ _TARGET_SIZE_=64

#$ TOP=/opt/cuda/bin/..

#$ LD_LIBRARY_PATH=/opt/cuda/bin/../lib:/opt/cuda/bin/../extools/lib:

#$ PATH=/opt/cuda/bin/../open64/bin:/opt/cuda/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/cuda/bin:/root/bin

#$ INCLUDES="-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"

#$ LIBRARIES=  "-L/opt/cuda/bin/../lib64" -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -D__CUDA_ARCH__=200 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS  -D__CUDACC__ -C  "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_0000384a_00000000-4_main_beta4_1.cpp1.ii" "main_beta4_1.cu"

#$ cudafe --m64 --gnu_version=40102 -tused --no_remove_unneeded_entities  --gen_c_file_name "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.gpu" --include_file_name "/tmp/tmpxft_0000384a_00000000-3_main_beta4_1.fatbin.c" "/tmp/tmpxft_0000384a_00000000-4_main_beta4_1.cpp1.ii"

#$ gcc -D__CUDA_ARCH__=200 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS  -D__CUDACC__ -C  -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -m64 -o "/tmp/tmpxft_0000384a_00000000-5_main_beta4_1.cpp2.i" "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.gpu"

#$ cudafe --m64 --gnu_version=40102 --c  --gen_c_file_name "/tmp/tmpxft_0000384a_00000000-6_main_beta4_1.cudafe2.c" --stub_file_name "/tmp/tmpxft_0000384a_00000000-6_main_beta4_1.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_0000384a_00000000-6_main_beta4_1.cudafe2.gpu" --include_file_name "/tmp/tmpxft_0000384a_00000000-3_main_beta4_1.fatbin.c" "/tmp/tmpxft_0000384a_00000000-5_main_beta4_1.cpp2.i"

#$ gcc -D__CUDA_ARCH__=200 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS  -D__CUDABE__  -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -m64 -o "/tmp/tmpxft_0000384a_00000000-7_main_beta4_1.cpp3.i" "/tmp/tmpxft_0000384a_00000000-6_main_beta4_1.cudafe2.gpu"

#$ filehash -s " " "/tmp/tmpxft_0000384a_00000000-7_main_beta4_1.cpp3.i" > "/tmp/tmpxft_0000384a_00000000-8_main_beta4_1.hash"

#$ gcc -E -x c++ -D__CUDACC__ -C  "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_0000384a_00000000-9_main_beta4_1.cpp4.ii" "main_beta4_1.cu"

#$ cudafe++ --m64 --gnu_version=40102 --parse_templates  --gen_c_file_name "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.cpp" --stub_file_name "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.stub.c" "/tmp/tmpxft_0000384a_00000000-9_main_beta4_1.cpp4.ii"

#$ nvopencc  -TARG:compute_20 -m64 -OPT:ftz=0 -CG:ftz=0 -CG:prec_div=1 -CG:prec_sqrt=1  "/tmp/tmpxft_0000384a_00000000-10_main_beta4_1" "/tmp/tmpxft_0000384a_00000000-7_main_beta4_1.cpp3.i"  -o "/tmp/tmpxft_0000384a_00000000-2_main_beta4_1.ptx"

#$ ptxas  -arch=sm_20 -m64  "/tmp/tmpxft_0000384a_00000000-2_main_beta4_1.ptx"  -o "/tmp/tmpxft_0000384a_00000000-11_main_beta4_1.sm_20.cubin"

#$ fatbinary --create="/tmp/tmpxft_0000384a_00000000-3_main_beta4_1.fatbin" --key="7d134be3e3c8cf0e" --ident="main_beta4_1.cu" -cuda "--image=profile=sm_20,file=/tmp/tmpxft_0000384a_00000000-11_main_beta4_1.sm_20.cubin" "--image=profile=compute_20,file=/tmp/tmpxft_0000384a_00000000-2_main_beta4_1.ptx" --embedded-fatbin="/tmp/tmpxft_0000384a_00000000-3_main_beta4_1.fatbin.c"

#$ rm /tmp/tmpxft_0000384a_00000000-3_main_beta4_1.fatbin

#$ gcc -D__CUDA_ARCH__=200 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS   -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -m64 -o "/tmp/tmpxft_0000384a_00000000-12_main_beta4_1.ii" "/tmp/tmpxft_0000384a_00000000-1_main_beta4_1.cudafe1.cpp"

#$ gcc -c -x c++ "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -fpreprocessed -m64 -o "/tmp/tmpxft_0000384a_00000000-13_main_beta4_1.o" "/tmp/tmpxft_0000384a_00000000-12_main_beta4_1.ii"

#$ g++ -m64 -o "main" -Wl,--start-group "/tmp/tmpxft_0000384a_00000000-13_main_beta4_1.o"   "-L/opt/cuda/bin/../lib64" -lcudart -Wl,--end-group

here’s the verbose output of nvcc at the moment of compilation.

one more weird fact that bothers me - is that no errors pop-up while running on a smaller set of data - and the correct result is produced; but when i pass a larger set (matrix) - “invalid configuration argument” error pops-up… i want my kernels to run on an undefined set of data (well at least what the compute capability allows me to).

tweaked the dimensions in order to produce no error - but the output is still buggy! im lost

Sorry i meant for you to set the PTXAS_FLAG ( --ptxas-options=-v ), so we could get ptxas info such as, ex:

From your description it sounds like you are invoking your kernel with a grid that is too large… Is the max number of Y blocks only int(572387/32) = 17887 or do you in fact invoke it with larger sets ?

ok! so i did pass the --ptxas-options=“-v” flag to nvcc and here’s what i got:

[root@serv-dev02a work]# nvcc --ptxas-options="-v" main_beta4_1.cu -o main

ptxas /tmp/tmpxft_000043d9_00000000-2_main_beta4_1.ptx, line 6938; warning : Double is not supported. Demoting to float

ptxas info    : Compiling entry function '_Z10final_stepPKfS0_S0_S0_Pfl' for 'sm_10'

ptxas info    : Used 6 registers, 48+16 bytes smem, 576 bytes cmem[0]

ptxas info    : Compiling entry function '_Z9first_countPKfS0_Pf' for 'sm_10'

ptxas info    : Used 5 registers, 1176+16 bytes smem, 576 bytes cmem[0], 68 bytes cmem[1]

ptxas info    : Compiling entry function '_Z10second_countPKiS0_Pf' for 'sm_10'

ptxas info    : Used 6 registers, 632+16 bytes smem, 576 bytes cmem[0], 16 bytes cmem[1]

ptxas info    : Compiling entry function '_Z20some_action_secondPKiS0_Pf' for 'sm_10'

ptxas info    : Used 5 registers, 1176+16 bytes smem, 576 bytes cmem[0], 40 bytes cmem[1]

ptxas info    : Compiling entry function '_Z20some_actionPKiS0_Pf' for 'sm_10'

ptxas info    : Used 5 registers, 2264+16 bytes smem, 576 bytes cmem[0], 88 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6reducePKfPfj' for 'sm_10'

ptxas info    : Used 6 registers, 4116+16 bytes smem, 576 bytes cmem[0], 4 bytes cmem[1]

so for the incorrectly working kernel, ptxas flag gives me

Used 5 registers, 2264+16 bytes smem, 576 bytes cmem[0], 88 bytes cmem[1]

i tweaked the dimensions for some_action kernel to (32,16,0) so my grid size now is (1,35774,1), seems capable of the compute capability since it supports up to 65535 Y-thread-blocks! so i guess it runs; i’ve also checked all the intermediate steps - such as “what data im passing” and these kind of things - and seem that im doing everything right (i.e. passing the right data)

ahh, eventually i’ve solved the mind-teasing problem - by breaking the original matrix being passed to the kernel into two smaller matrices (at most 16 elements in width and ran on 16 x-threads in a block), and passing them onto my kernel. i haven’t lost any computing power due the small difference in run times, but getting the correct results.

Anyway’s thanks to Jimmy Pettersson for his support in solving this issue!
Stay tuned CUDA developers! :)

Ah, great! I’m still curious as to what the real error was as you didn’t seem to be exceeding the max grid dimensions, I’m sure it’s right under our noses :)

Jimmy, i wish i knew - i was looking for an explanation, but can not find one; well its working - producing same results on CPU and GPU… good for me :)
anyways thanks!!