Writing to shared memory from different blocks?

I’'m not sure if I should use shared memory but global memory doesn’t work or I don’t know how…

I have a use case where I want to make a frequency table from a dataset.
My dataset can contain millions of numbers so I’ll end up with a lot of blocks. My frequency table is a small array. Maybe 10 elements.

this code shows what I want to do but obviously doesn’t work due to concurrency of all the threads…

__global__ void device_frequency(long* input, int* output) {

	long index = gridDim.x * blockDim.x * blockIdx.x + blockDim.x * blockIdx.y + threadIdx.x;
	int number = input[index];
	if (index < N) {
		output[number] = output[number] + 1;

I somehow need to coordinate that all threads can write to that output array which contains the frequencies.