There are tensor cores, INT32, FP32 cores, FP64 cores in a volta GPU. When a program is using the tensor cores to perform training, all the other cores seems to be idle? Is there a way to fully utilize all the cores so that all of them can process something at the same time?

It is almost certainly not the case. All operations on a GPU are pipelined. The compiler will seek to schedule independent instructions in such a way that compute pipelines can have overlapping work.

For example, integer indexing calculation may be flowing through the int32 pipes in a volta at the same time that the tensor cores are busy computing a particular matrix-matrix multiply result.

The integer calculations are needed so that each thread can determine the next set of inputs for the next matrix-multiply step.

Getting every single functional unit to be fully saturated with activity on every clock cycle, forever, would certainly be impossible.

I doubt any of this, from a general platitude perspective, is any different than a modern CPU.