Not deterministic cuda code in release mode

Hello,
I am experiencing a not deterministic output from my cuda code.

I have written some cuda code, and I have tested it in debug mode (with -g -G flag). I have verified that the output is deterministic launching the same code many times in a loop: the output is always the same.
So the code works fine I assume, even if naturally is a bit slow.

In order to speed up the code I removed the -g -G flags (so the -O3 flag is set by default, right?). The code is noticeably faster, but no more deterministic!

Using the NVCC flags:

NVCCFLAGS = --compiler-options -fno-strict-aliasing --ptxas-options=-v -use_fast_math

the problem occurs very often (about 10% of the code executions in the loop).

Using instead the flags

NVCCFLAGS = --compiler-options -fno-strict-aliasing --ptxas-options=-v -use_fast_math -prec-div=true -ftz=false -prec-sqrt=true -fmad=false

the problem occurs fewer timers (about 0.5% or less of executions), but anyway still occurs!

Any idea?
Thanks for the help

EDIT: for completeness I attach in the following the part of my .pro file for the compilation of the cuda code (I am working in Qt):

CUDA_SOURCES += cuda_test.cu
CUDA_DIR = /usr/local/cuda-7.5/
CUDA_ARCH = sm_52
NVCCFLAGS = --compiler-options -fno-strict-aliasing --ptxas-options=-v -use_fast_math -prec-div=true -ftz=false -prec-sqrt=true -fmad=false 

INCLUDEPATH += $$CUDA_DIR/include
INCLUDEPATH += $$CUDA_DIR/samples/common/inc

QMAKE_LIBDIR += $$CUDA_DIR/lib64

LIBS += -L/usr/local/cuda-7.5/lib64/ \
        -lcuda \
        -lcudart

CUDA_INC = $$join(INCLUDEPATH,' -I','-I',' ')

cuda.input = CUDA_SOURCES
cuda.output = ${OBJECTS_DIR}${QMAKE_FILE_BASE}_cuda.o
cuda.commands = $$CUDA_DIR/bin/nvcc -m64 -arch=$$CUDA_ARCH -c $$NVCCFLAGS $$CUDA_INC $$LIBS  ${QMAKE_FILE_NAME} -o ${QMAKE_FILE_OUT}
cuda.dependency_type = TYPE_C
cuda.depend_command = $$CUDA_DIR/bin/nvcc -M $$CUDA_INC $$NVCCFLAGS   ${QMAKE_FILE_NAME}
QMAKE_EXTRA_COMPILERS += cuda

My crystal ball is currently malfunctioning, so I am afraid I have to see some cuda_test.cu source code before I can offer any assistance.

run memcheck and racecheck

if that does not help, toggle the flags one at a time, not both at the same time
first only with -G removed, then only with -g removed
this should help clear up whether it is likely host side or device side

if the problem persists, a dirty solution is to tear the program into 2 camps as libraries
and then set the flags on the one library, but not the other
this way you may be able to spot the functions/ kernels that can not withstand the flags removed

to my knowledge, it is not possible to specify the flags on a function/ kernel level, although this would be a sound proposition from a debugging perspective