There are tensor cores, INT32, FP32 cores, FP64 cores in a volta GPU. When a program is using the tensor cores to perform training, all the other cores seems to be idle? Is there a way to fully utilize all the cores so that all of them can process something at the same time?
How do you know that?
It is almost certainly not the case. All operations on a GPU are pipelined. The compiler will seek to schedule independent instructions in such a way that compute pipelines can have overlapping work.
For example, integer indexing calculation may be flowing through the int32 pipes in a volta at the same time that the tensor cores are busy computing a particular matrix-matrix multiply result.
The integer calculations are needed so that each thread can determine the next set of inputs for the next matrix-multiply step.
Getting every single functional unit to be fully saturated with activity on every clock cycle, forever, would certainly be impossible.
I doubt any of this, from a general platitude perspective, is any different than a modern CPU.