Program crash - cudaMemcpy

When executing this code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

int main()
	const int ARRAY_SIZE = 1000000;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

	float h_in[ARRAY_SIZE];
	for (size_t i = 0; i < ARRAY_SIZE; i++)
		h_in[i] = 0;

	float *d_in, *d_out;

	cudaMalloc((void **)&d_in, ARRAY_BYTES);
	cudaMalloc((void **)&d_out, sizeof(float));

	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

    return 0;

My program crashes, I can’t seem to get any debugging output.

It’s a simplified bit from a tutorial i’m following where the tutorial code wouldn’t compile, so I narrowed it down to this. (from Udacity, chapter 3 (reduce using global and shared memory)).
It happens in both VS2015 and VS2017 with cuda 9.1 with the latest 3 patches, it happens in the above code only if I take an array size above 966144 bytes and only on occasion.

I’m running this on a i7-6800k with a GTX1080.

modify line to:

static float h_in[ARRAY_SIZE];

overall, you can’t alloc so much data in program stack, it’s not cuda-specific

Ok thanks!

I have much to learn about memory allocation and programming in general.