Thread number of kernel calls

I have a problem with my neuronal net calculation. the number of threads should be equal to the number of neurons in a layer.

my problem now is: if i have more than 512 neurons, my calculation is failing!

because the cpu-code is working with more than 512 neurons and the gpu up to 512 neurons too, i think the problem might lie in the way a call my kernels.

if i have 1024 neurons/threads to handle do i have to set <<<2,512>>> below? or should <<<1,1024>>> work too?

so what exactly will happen when i choose more threads per block, than available by my device?

for(int i = 0; i < iRuns; i++) {

		std::cout<<"Training run: "<<i<<std::endl;

		

		for(int y = 0; y < Weights.GetH(); y++) {

			devRunFW <<< 1, Neurons.GetW() >>> 

					(pNet[y].pNeurons, 

					pNet[y].pWeights, 

					pNet[y+1].pNeurons, 

					Weights.GetD() );

		}

		devUpdateOutpDelta <<< 1, iInpS >>> 

				(pNet[Neurons.GetH()-1].pNeurons, 

				pNet[Neurons.GetH()-1].pErrors, 

				pOut_dev);

		for(int y = Weights.GetH()-1; y >= 0; y--) {

			devCalcErrorDelta <<< 1, Neurons.GetW() >>> 

					(pNet[y].pNeurons, 

					pNet[y].pWeights, 

					pNet[y].pErrors, 

					pNet[y+1].pErrors,

					Weights.GetD() );

		}

		

		for(int y = Weights.GetH()-1; y >= 0; y--) {

			devAdaptWeights <<< 1, Neurons.GetW() >>> 

					(pNet[y].pNeurons, 

					pNet[y].pWeights, 

					pNet[y].pErrors, 

					pNet[y+1].pErrors, 

					Weights.GetD(), fLearningRate);

		}

	}

hmm, i found the problem, it was like expected: 1024 threads must get splitted to 1024/512 blocks.

but if i use 768 neurons in 2 blocks with 512 threads. the programm doesnt crash. why?