You don’t need to always allocate memory with cudaMalloc (though it is certainly the most straightforward way to manage it). But an array declared device cannot be accessed on the host without cudaGetSymbolAddress and a cudaMemcpy. Are you just accessing g_array in the host code? This is likely the cause of your crash because it would dereference an invalid pointer. In emulation, “device” arrays are actually on the host, so it works without any warnings.
Ok, it seems like you are doing everything correctly then. Perhaps the best thing you can do at this point is to create a minimal test case file that reproduces your problem (preferably one that can be directly compiled with nvcc -o exec file.cu) and post it here. There has to be some little detail you missed somewhere.