sum up of each block result

The total according to each block has come out from the result though the total of a big array was able to be put out by using the sample code of reduction.
Isn’t there method of easily totaling the total of this each block on the device side?

I presume you are asking about how to complete the final stage of the optimal multi-block parallel reduction example in the SDK on the device side?

The only way to do it is to launch a second reduction kernel with one block after the first reduction has finished. There is no inter-block communication mechanism in CUDA, so there is no way for any running block to know whether all the other blocks have run and completed their reduction operations so it could do the last reduction stage.