I’m new to CUDA, and this is my first post.
I’m using Tesla V100 for my project.
I know that the volta architecture can run INT32 instructions and FP32 instructions concurrently. But I don’t know about the INT 32 and FP64 pair.
I want to do polynomial multiplications in parallel, so I want to do FFT and NTT in parallel to use the GPU’s resource as much as possible.
More precisely, I’m using INT64 operations in NTT, but, as far as I know, INT64 operations are executed as combinations of INT32 instructions.
Yes. INT32 can be executed concurrently with basically any other non-INT instruction.
Thank you a lot for your quick reply.
I’ll try the idea.
GV100 has several math pipelines:
FMA pipe - executes FP32 instructions and IMAD (integer multiply and add)
ALU pipe - executes INT32 (not IMAD), logical operations, binary operations, and data movement operations
FP64 pipe - executes FP64 instructions
FP16 pipe - executes FP16x2 instructions
Tensor pipe - executes matrix multiply and accumulate instructions
The FP64, FP16, and Tensor pipe use the same dispatch port so you cannot dispatch to these pipes at the same time.
The FMA and ALU pipeline each have separate dispatch ports. It takes 2 cycles to dispatch a warp to the each of these pipes (16 cores).
Concurrent execution is done by alternating instruction dispatch to different pipes.
On GV100 INT64 math is implemented by various units including:
- FMA pipe - IMAD instruction
- ALU pipe - LEA instruction
The answer to your question is not straight-forward.
Pipeline utilization metrics for GV100 are available in CUDA >= 10.1 tools. Nsight 2019.1 adds the pipeline utilization in the Compute Workload Analysis section.