Instruction level parallelism and maximum instruction per clock

amintahertalari · September 2, 2019, 8:09am

Hi. my device is Geforce 940mx which is of compute capabilty 5.0
according to cuda programming guide it is able to perform 128 fp32 additions per clock per SM which means at each instruction issue clock cycle each of 4 schedulers should issue 32 operations to to 32 threads of a warp. by that we conclude that the device should be able to achieve 4 instructions per clock at the issue efficiency section of nsight profiler if a code consists of all fp32 additions. or maybe a little less like 2 instructions per clock due to other operations of kernel being counted by profiler.

but regardless of the code i write and run , i can never achieve any higher than 0.95 IPC and something like 0.9 eligible warps per SM.
is there any code that you have tested on your device and has given you higher than 1 is those two categories so i can run and test? because even the simplest code like the one below wont get the device two reach 1 eligible warps/SM or 1 IPC.
here is the most simple one i tried:

__global__ void mytest() {
	float a = 1.23f;
	float b = 3.32f;
	float c;
//#pragma unroll 100
	for (int i = 0; i < 1000; i++)
	{
		c = a + b;

	}

}

i know that the for loop causes some additional integer addition and condition checking but does it affect the performance enough to reduce IPC and eligible warps to below 1 ?

and about occupancy, since the code is simple is not anything bound , i run it <<<96 , 64>>> to use all 6144 possible threads across three SMs.

Robert_Crovella · September 2, 2019, 10:52am

You probably want to learn how to use the CUDA binary utilities when doing this type of work:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

When I compile your code, I get an “empty” kernel. The compiler recognizes that there isn’t any global state being modified, and so generates a kernel that does nothing.

Here’s a modification I wrote that seems to give issued_ipc of 3.6 on GTX 960 on CUDA 10.1:

$ cat t22.cu
#include <stdio.h>
__global__ void mytest(float c, float a) {
        float a1 = a;
        float c1 = c;
        float a2 = a+1;
        float c2 = c+1;
        float a3 = a+2;
        float c3 = c+2;
        float a4 = a+3;
        float c4 = c+3;
#pragma unroll 256
        for (int i = 0; i < 256; i++)
        {
                c1 += a1;
                c2 += a2;
                c3 += a3;
                c4 += a4;
        }
        if (c1 == 0) printf("?");
        if (c2 == 0) printf("!");
        if (c3 == 0) printf(".");
        if (c4 == 0) printf("*");
}

int main(){

        mytest<<<256,256>>>(0.1f, 0.2f);
        cudaDeviceSynchronize();
}
$ nvcc -arch=sm_52 -o t22 t22.cu
$ nvprof --metrics issued_ipc ./t22
==9514== NVPROF is profiling process 9514, command: ./t22
==9514== Profiling application: ./t22
==9514== Profiling result:
==9514== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 960 (0)"
    Kernel: mytest(float, float)
          1                                issued_ipc                                Issued IPC    3.612739    3.612739    3.612739
$

take the above, compile it (with -arch=sm_50 to match your GPU), and then use:

cuobjdump -sass t22

Do the same thing with your code, and compare the difference in the generated machine code.

Note the above level of complexity in the kernel code is probably not necessary. I was able to get 3.4 ipc with just a single line in the loop in the kernel code:

c1 += a1;

but I’ll leave the further experimentation and study up to you. Increasing the block count also increases the ipc, according to my testing.

amintahertalari · September 2, 2019, 12:40pm

Robert_Crovella:

You probably want to learn how to use the CUDA binary utilities when doing this type of work:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

When I compile your code, I get an “empty” kernel. The compiler recognizes that there isn’t any global state being modified, and so generates a kernel that does nothing.

Here’s a modification I wrote that seems to give issued_ipc of 3.6 on GTX 960 on CUDA 10.1:
$ cat t22.cu
#include <stdio.h>
__global__ void mytest(float c, float a) {
        float a1 = a;
        float c1 = c;
        float a2 = a+1;
        float c2 = c+1;
        float a3 = a+2;
        float c3 = c+2;
        float a4 = a+3;
        float c4 = c+3;
#pragma unroll 256
        for (int i = 0; i < 256; i++)
        {
                c1 += a1;
                c2 += a2;
                c3 += a3;
                c4 += a4;
        }
        if (c1 == 0) printf("?");
        if (c2 == 0) printf("!");
        if (c3 == 0) printf(".");
        if (c4 == 0) printf("*");
}

int main(){

        mytest<<<256,256>>>(0.1f, 0.2f);
        cudaDeviceSynchronize();
}
$ nvcc -arch=sm_52 -o t22 t22.cu
$ nvprof --metrics issued_ipc ./t22
==9514== NVPROF is profiling process 9514, command: ./t22
==9514== Profiling application: ./t22
==9514== Profiling result:
==9514== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 960 (0)"
    Kernel: mytest(float, float)
          1                                issued_ipc                                Issued IPC    3.612739    3.612739    3.612739
$
take the above, compile it (with -arch=sm_50 to match your GPU), and then use:

cuobjdump -sass t22

Do the same thing with your code, and compare the difference in the generated machine code.

Note the above level of complexity in the kernel code is probably not necessary. I was able to get 3.4 ipc with just a single line in the loop in the kernel code:
c1 += a1;
but I’ll leave the further experimentation and study up to you. Increasing the block count also increases the ipc, according to my testing.

Thank you for your answer.
Even the code you sent caps at 0.99 for both IPC and eligible warps.
is there any way to figure out why does this happen?
like comparing cuobjdump output of this code on your machine and mine ?maybe they compile differently?
if so, could you please send the -sass output?
and if no, any suggestions?

and about the link you sent, its about cuobjdump options rather than understanding its output. so i think i have to read all the PTX and ISA in https://docs.nvidia.com/cuda/parallel-thread-execution/index.html for that.

Robert_Crovella · September 2, 2019, 3:14pm

I simplified the code:

$ cat t22.cu
#include <stdio.h>
__global__ void mytest(float c, float a) {
#pragma unroll 256
        for (int i = 0; i < 256; i++)
                c += a;
        if (c == 0) printf("?");
}

int main(){

        mytest<<<8192,256>>>(0.1f, 0.2f);
        cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t22 t22.cu
$ nvprof --metrics issued_ipc,ipc,eligible_warps_per_cycle ./t22
==10405== NVPROF is profiling process 10405, command: ./t22
==10405== Profiling application: ./t22
==10405== Profiling result:
==10405== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 960 (0)"
    Kernel: mytest(float, float)
          1                                issued_ipc                                Issued IPC    3.855329    3.855329    3.855329
          1                                       ipc                              Executed IPC    3.855097    3.855097    3.855097
          1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    4.242007    4.242007    4.242007
$ cuobjdump -sass ./t22 

Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_50

Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_50
                Function : _Z6mytestff
        .headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                       /* 0x001fd800fec007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;              /* 0x4c98078000870001 */
        /*0010*/                   MOV R0, c[0x0][0x144] ;             /* 0x4c98078005170000 */
        /*0018*/                   FADD R0, R0, c[0x0][0x140] ;        /* 0x4c58000005070000 */
                                                                       /* 0x001fd800fec007f6 */
        /*0028*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0030*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0038*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007e6 */
        /*0048*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0050*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0058*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007f6 */
        /*0068*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0070*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0078*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800ffa007e6 */
        /*0088*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0090*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0098*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007f6 */
        /*00a8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00b0*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00b8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fcc007f6 */
        /*00c8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00d0*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00d8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007f6 */
        /*00e8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00f0*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*00f8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001f9800ffa007f6 */
        /*0108*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0110*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0118*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007f6 */
        /*0128*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */

The above pattern repeats until the final output:

/*0a48*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a50*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a58*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007f6 */
        /*0a68*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a70*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a78*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800ffa007f6 */
        /*0a88*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a90*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0a98*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fd800fec007e6 */
        /*0aa8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0ab0*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
        /*0ab8*/                   FADD R0, R0, c[0x0][0x144] ;        /* 0x4c58000005170000 */
                                                                       /* 0x001fc400ffa007ed */
        /*0ac8*/                   FSETP.NEU.AND P0, PT, R0, RZ, PT ;  /* 0x5bbd03800ff70007 */
        /*0ad0*/               @P0 EXIT ;                              /* 0xe30000000000000f */
        /*0ad8*/                   MOV32I R4, 0x0 ;                    /* 0x010000000007f004 */
                                                                       /* 0x001fd800fe2007f1 */
        /*0ae8*/                   MOV32I R5, 0x0 ;                    /* 0x010000000007f005 */
        /*0af0*/                   MOV R6, RZ ;                        /* 0x5c9807800ff70006 */
        /*0af8*/                   MOV R7, RZ ;                        /* 0x5c9807800ff70007 */
                                                                       /* 0x001fbc00fde007fd */
        /*0b08*/                   JCAL 0x0 ;                          /* 0xe220000000000040 */
        /*0b10*/                   NOP ;                               /* 0x50b0000000070f00 */
        /*0b18*/                   NOP ;                               /* 0x50b0000000070f00 */
                                                                       /* 0x001ffc00ffe007e4 */
        /*0b28*/                   NOP ;                               /* 0x50b0000000070f00 */
        /*0b30*/                   EXIT ;                              /* 0xe30000000007000f */
        /*0b38*/                   BRA 0xb38 ;                         /* 0xe2400fffff87000f */
                ......................

Fatbin ptx code:
================
arch = sm_50
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

amintahertalari · September 2, 2019, 7:38pm

and again thank you a lot for your answer.

still no luck. 0.96 it is.
i will try these on a 1060 soon, check to see if its a problem with my device.

Greg · September 3, 2019, 4:37pm

@amintahertalari Please specify what tool and metrics you are using. Ideally execute the same commands as Robert. Nsight Compute and Perfworks based tools can show IPC at the SM level or SM sub-partition (warp scheduler) level. The maximum IPC for each level for CC 5.x-6.x is

SM

6.0 instructions/cycle for mixed instructions
4.0 instructions/cycle for FADD only

SMSP (sub-partition, 4 per SM)

1.5 instructions/cycle for mixed instructions
1.0 instructions/cycle for FADD only

If you are not achieving this and the tool is reporting 0.96 then please review the SASS assembly code. You may executing a debug build.

Robert’s results match my expectations on a CC 5.2 device. On a CC 5.0 I expect eligible warps to be ~2.6 which will limit maximum IPC to 2.6 (instead of near 4.0). In order to improve this the kernel needs instruction level parallelism. Robert’s first kernel has instruction level parallelism. Your kernel and Robert’s measurement has 0 instruction level parallelism in the unrolled loop as each FADD is dependent on the previous result.

Launching large thread blocks is likely to slightly improve your results. Thread block launch and complete has much higher overhead than warp launch and complete.

amintahertalari · September 3, 2019, 5:53pm

thanks Greg.
this is unbelievable.
just as you said the problem was that project was set to Debug , setting it to Release increased the numbers to exactly what you said you’d expect them to be.

Thank you both. this was really helpful.

Topic		Replies	Views
Verify cuda core peak fp32 performance CUDA Programming and Performance	10	398	May 2, 2024
Illegal instruction (error 715) with H100 CUDA Programming and Performance	27	186	September 18, 2024
Concurrent execution of CUDA and Tensor cores CUDA Programming and Performance	34	7789	November 3, 2024
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	98	October 29, 2024
Yet another performance question CUDA Programming and Performance	16	4228	February 12, 2009
Pipeline operator forwarding for integer instructions in CUDA CUDA Programming and Performance cuda , kernel	25	264	July 15, 2024
Blackwell Integer CUDA Programming and Performance	82	786	February 26, 2025
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	676	April 4, 2017
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7164	March 30, 2013
Performance goes down when our kernel function runs on Xavier compared to GeForce 940MX Jetson AGX Xavier	14	1093	July 25, 2019

Instruction level parallelism and maximum instruction per clock

Related topics