==PROF== Connected to process 767673 (/usr/bin/python3.8)
==PROF== Profiling "vectorized_elementwise_kernel" - 0 (1/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 1 (2/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 2 (3/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 3 (4/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 4 (5/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 5 (6/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 6 (7/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 7 (8/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 8 (9/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 9 (10/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 10 (11/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 11 (12/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 12 (13/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 13 (14/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 14 (15/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 15 (16/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 16 (17/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 17 (18/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 18 (19/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 19 (20/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 20 (21/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 21 (22/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 22 (23/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 23 (24/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 24 (25/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 25 (26/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 26 (27/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 27 (28/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 28 (29/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 29 (30/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 30 (31/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 31 (32/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 32 (33/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 33 (34/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 34 (35/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 35 (36/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 36 (37/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 37 (38/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 38 (39/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 39 (40/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 40 (41/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 41 (42/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 42 (43/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 43 (44/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 44 (45/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 45 (46/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 46 (47/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 47 (48/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 48 (49/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 49 (50/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 50 (51/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 51 (52/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 52 (53/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 53 (54/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 54 (55/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 55 (56/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 56 (57/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 57 (58/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 58 (59/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 59 (60/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 60 (61/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 61 (62/300): 0%....50%....100% - 9 passes
==PROF== Profiling "distribution_elementwise_grid..." - 62 (63/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 63 (64/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 64 (65/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 65 (66/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 66 (67/300): 0%....50%....100% - 9 passes
==PROF== Profiling "transpose_readWrite_alignment..." - 67 (68/300): 0%....50%....100% - 9 passes
==PROF== Profiling "transpose_readWrite_alignment..." - 68 (69/300): 0%....50%....100% - 9 passes
==PROF== Profiling "transpose_readWrite_alignment..." - 69 (70/300): 0%....50%....100% - 9 passes
==PROF== Profiling "transpose_readWrite_alignment..." - 70 (71/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 71 (72/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 72 (73/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 73 (74/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 74 (75/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 75 (76/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 76 (77/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 77 (78/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 78 (79/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 79 (80/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 80 (81/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 81 (82/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 82 (83/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 83 (84/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 84 (85/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 85 (86/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 86 (87/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 87 (88/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 88 (89/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 89 (90/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 90 (91/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 91 (92/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 92 (93/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 93 (94/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 94 (95/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 95 (96/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 96 (97/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 97 (98/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 98 (99/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 99 (100/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 100 (101/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 101 (102/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 102 (103/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 103 (104/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 104 (105/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 105 (106/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 106 (107/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 107 (108/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 108 (109/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 109 (110/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 110 (111/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 111 (112/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 112 (113/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 113 (114/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 114 (115/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 115 (116/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 116 (117/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 117 (118/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 118 (119/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 119 (120/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 120 (121/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 121 (122/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 122 (123/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 123 (124/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 124 (125/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 125 (126/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 126 (127/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 127 (128/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 128 (129/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 129 (130/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 130 (131/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 131 (132/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 132 (133/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 133 (134/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 134 (135/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 135 (136/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 136 (137/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 137 (138/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 138 (139/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 139 (140/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 140 (141/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 141 (142/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 142 (143/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 143 (144/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 144 (145/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 145 (146/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 146 (147/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 147 (148/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 148 (149/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 149 (150/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 150 (151/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 151 (152/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 152 (153/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 153 (154/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 154 (155/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 155 (156/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 156 (157/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 157 (158/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 158 (159/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 159 (160/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 160 (161/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 161 (162/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 162 (163/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 163 (164/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 164 (165/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 165 (166/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 166 (167/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 167 (168/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 168 (169/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 169 (170/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 170 (171/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 171 (172/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 172 (173/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 173 (174/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 174 (175/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 175 (176/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 176 (177/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 177 (178/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 178 (179/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 179 (180/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 180 (181/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 181 (182/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 182 (183/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 183 (184/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 184 (185/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 185 (186/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 186 (187/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 187 (188/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 188 (189/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 189 (190/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 190 (191/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 191 (192/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 192 (193/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 193 (194/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 194 (195/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 195 (196/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 196 (197/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 197 (198/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 198 (199/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 199 (200/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 200 (201/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 201 (202/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 202 (203/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 203 (204/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 204 (205/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 205 (206/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 206 (207/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 207 (208/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 208 (209/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 209 (210/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 210 (211/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 211 (212/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 212 (213/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 213 (214/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 214 (215/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 215 (216/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 216 (217/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 217 (218/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 218 (219/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 219 (220/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 220 (221/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 221 (222/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 222 (223/300): 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_sgemm_32x32_sliced1x4_tn" - 223 (224/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 224 (225/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 225 (226/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 226 (227/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 227 (228/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 228 (229/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 229 (230/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 230 (231/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 231 (232/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 232 (233/300): 0%....50%....100% - 9 passes
==PROF== Profiling "softmax_warp_forward" - 233 (234/300): 0%....50%....100% - 9 passes
==PROF== Profiling "gemv2N_kernel" - 234 (235/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 235 (236/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 236 (237/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 237 (238/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 238 (239/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 239 (240/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 240 (241/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 241 (242/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 242 (243/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 243 (244/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 244 (245/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 245 (246/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 246 (247/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 247 (248/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 248 (249/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 249 (250/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 250 (251/300): 0%....50%....100% - 9 passes
==PROF== Profiling "softmax_warp_forward" - 251 (252/300): 0%....50%....100% - 9 passes
==PROF== Profiling "gemv2N_kernel" - 252 (253/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 253 (254/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 254 (255/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 255 (256/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 256 (257/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 257 (258/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 258 (259/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 259 (260/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 260 (261/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 261 (262/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 262 (263/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 263 (264/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 264 (265/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 265 (266/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 266 (267/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 267 (268/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 268 (269/300): 0%....50%....100% - 9 passes
==PROF== Profiling "softmax_warp_forward" - 269 (270/300): 0%....50%....100% - 9 passes
==PROF== Profiling "gemv2N_kernel" - 270 (271/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 271 (272/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 272 (273/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 273 (274/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 274 (275/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 275 (276/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 276 (277/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 277 (278/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 278 (279/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 279 (280/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 280 (281/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 281 (282/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 282 (283/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 283 (284/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 284 (285/300): 0%....50%....100% - 9 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 285 (286/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 286 (287/300): 0%....50%....100% - 9 passes
==PROF== Profiling "softmax_warp_forward" - 287 (288/300): 0%....50%....100% - 9 passes
==PROF== Profiling "gemv2N_kernel" - 288 (289/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 289 (290/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 290 (291/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 291 (292/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 292 (293/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 293 (294/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 294 (295/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 295 (296/300): 0%....50%....100% - 9 passes
==PROF== Profiling "reduce_kernel" - 296 (297/300): 0%....50%....100% - 9 passes
==PROF== Profiling "indexSelectLargeIndex" - 297 (298/300): 0%....50%....100% - 9 passes
==PROF== Profiling "fused_dropout_kernel_vec" - 298 (299/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 299 (300/300): 0%....50%....100% - 9 passes
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
[767673] python3.8@127.0.0.1
  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.71
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                          2,990
    Memory [%]                                                                           %                           4.82
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.50
    L1/TEX Cache Throughput                                                              %                           8.10
    L2 Cache Throughput                                                                  %                           4.82
    SM Active Cycles                                                                 cycle                         878.72
    Compute (SM) [%]                                                                     %                           0.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         147
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          9,408
    Waves Per SM                                                                                                     0.14
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           8.37
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (8.4%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.56
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          3,018
    Memory [%]                                                                           %                           5.43
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.56
    L1/TEX Cache Throughput                                                              %                           8.97
    L2 Cache Throughput                                                                  %                           5.43
    SM Active Cycles                                                                 cycle                         912.01
    Compute (SM) [%]                                                                     %                           0.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.2 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         170
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         10,880
    Waves Per SM                                                                                                     0.16
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           9.31
    Achieved Active Warps Per SM                                                      warp                           4.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (9.3%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.07
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                         11,550
    Memory [%]                                                                           %                          17.27
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           8.51
    L1/TEX Cache Throughput                                                              %                          19.71
    L2 Cache Throughput                                                                  %                          17.27
    SM Active Cycles                                                                 cycle                       9,466.43
    Compute (SM) [%]                                                                     %                          55.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          71.10
    Achieved Active Warps Per SM                                                      warp                          34.13
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (71.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.77
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,057
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.75
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         753.69
    Compute (SM) [%]                                                                     %                           5.77
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.46
    Achieved Active Warps Per SM                                                      warp                           7.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.93
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,188
    Memory [%]                                                                           %                           1.35
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.04
    L1/TEX Cache Throughput                                                              %                           3.80
    L2 Cache Throughput                                                                  %                           1.32
    SM Active Cycles                                                                 cycle                       1,486.41
    Compute (SM) [%]                                                                     %                          11.17
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.63
    Achieved Active Warps Per SM                                                      warp                           7.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.42
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          2,762
    Memory [%]                                                                           %                           0.68
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.40
    L1/TEX Cache Throughput                                                              %                          66.16
    L2 Cache Throughput                                                                  %                           0.68
    SM Active Cycles                                                                 cycle                          13.60
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.21
    Achieved Active Warps Per SM                                                      warp                           2.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.24
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                          2,765
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                          65.38
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.76
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.12
    Achieved Active Warps Per SM                                                      warp                           1.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.74
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,039
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.80
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         743.13
    Compute (SM) [%]                                                                     %                           5.79
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.72
    Achieved Active Warps Per SM                                                      warp                           8.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          4,087
    Memory [%]                                                                           %                           1.38
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.78
    L2 Cache Throughput                                                                  %                           1.35
    SM Active Cycles                                                                 cycle                       1,495.85
    Compute (SM) [%]                                                                     %                          11.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.50
    Achieved Active Warps Per SM                                                      warp                           7.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.48
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          2,787
    Memory [%]                                                                           %                           0.68
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          66.09
    L2 Cache Throughput                                                                  %                           0.68
    SM Active Cycles                                                                 cycle                          13.62
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           2.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.52
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,775
    Memory [%]                                                                           %                           0.68
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          66.16
    L2 Cache Throughput                                                                  %                           0.68
    SM Active Cycles                                                                 cycle                          13.60
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           2.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.77
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,057
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         997.71
    Compute (SM) [%]                                                                     %                           7.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.60
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.30
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                          2,738
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.40
    L1/TEX Cache Throughput                                                              %                          67.92
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.25
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.12
    Achieved Active Warps Per SM                                                      warp                           1.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.04
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                         11,581
    Memory [%]                                                                           %                          17.25
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           8.54
    L1/TEX Cache Throughput                                                              %                          19.64
    L2 Cache Throughput                                                                  %                          17.25
    SM Active Cycles                                                                 cycle                       9,456.69
    Compute (SM) [%]                                                                     %                          55.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          70.97
    Achieved Active Warps Per SM                                                      warp                          34.07
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (71.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.74
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,046
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.79
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         744.32
    Compute (SM) [%]                                                                     %                           5.79
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.58
    Achieved Active Warps Per SM                                                      warp                           7.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.72
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,090
    Memory [%]                                                                           %                           1.38
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.04
    L1/TEX Cache Throughput                                                              %                           3.80
    L2 Cache Throughput                                                                  %                           1.35
    SM Active Cycles                                                                 cycle                       1,487.81
    Compute (SM) [%]                                                                     %                          11.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.58
    Achieved Active Warps Per SM                                                      warp                           7.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.50
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,746
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          66.09
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.62
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.15
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.44
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,748
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          65.45
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.75
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.51
    Achieved Active Warps Per SM                                                      warp                           2.17
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.5%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.68
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,048
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                              0
    Duration                                                                       usecond                           3.01
    L1/TEX Cache Throughput                                                              %                           3.79
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         744.22
    Compute (SM) [%]                                                                     %                           5.77
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.68
    Achieved Active Warps Per SM                                                      warp                           8.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.60
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          4,083
    Memory [%]                                                                           %                           1.39
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.35
    SM Active Cycles                                                                 cycle                       1,495.96
    Compute (SM) [%]                                                                     %                          11.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.61
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.44
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,755
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          65.38
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.76
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.63
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          2,754
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          66.23
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.59
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           2.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.59
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,076
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.04
    L1/TEX Cache Throughput                                                              %                           3.80
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                            992
    Compute (SM) [%]                                                                     %                           7.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.81
    Achieved Active Warps Per SM                                                      warp                           8.07
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.45
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          2,699
    Memory [%]                                                                           %                           0.70
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          67.18
    L2 Cache Throughput                                                                  %                           0.70
    SM Active Cycles                                                                 cycle                          13.40
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.74
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,034
    Memory [%]                                                                           %                           1.19
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           4.89
    L2 Cache Throughput                                                                  %                           1.19
    SM Active Cycles                                                                 cycle                            186
    Compute (SM) [%]                                                                     %                           1.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           6
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,536
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.61
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.45
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,737
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          67.25
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.38
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.04
    Achieved Active Warps Per SM                                                      warp                           1.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.74
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,029
    Memory [%]                                                                           %                           1.19
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           4.86
    L2 Cache Throughput                                                                  %                           1.19
    SM Active Cycles                                                                 cycle                         187.18
    Compute (SM) [%]                                                                     %                           1.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           6
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,536
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.91
    Achieved Active Warps Per SM                                                      warp                           8.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.36
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,703
    Memory [%]                                                                           %                           0.82
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          67.48
    L2 Cache Throughput                                                                  %                           0.82
    SM Active Cycles                                                                 cycle                          13.34
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.06
    Achieved Active Warps Per SM                                                      warp                           1.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.04
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          8,385
    Memory [%]                                                                           %                          13.66
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                           6.18
    L1/TEX Cache Throughput                                                              %                          15.34
    L2 Cache Throughput                                                                  %                          13.66
    SM Active Cycles                                                                 cycle                       6,309.54
    Compute (SM) [%]                                                                     %                          48.26
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          71.84
    Achieved Active Warps Per SM                                                      warp                          34.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (71.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.80
    SM Frequency                                                             cycle/nsecond                           1.50
    Elapsed Cycles                                                                   cycle                          5,344
    Memory [%]                                                                           %                           2.65
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                           5.35
    L2 Cache Throughput                                                                  %                           2.20
    SM Active Cycles                                                                 cycle                       2,638.31
    Compute (SM) [%]                                                                     %                          21.78
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          28.34
    Achieved Active Warps Per SM                                                      warp                          13.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (28.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          4,091
    Memory [%]                                                                           %                           1.38
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.34
    SM Active Cycles                                                                 cycle                       1,499.13
    Compute (SM) [%]                                                                     %                          11.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.47
    Achieved Active Warps Per SM                                                      warp                           7.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.39
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,751
    Memory [%]                                                                           %                           0.81
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          64.90
    L2 Cache Throughput                                                                  %                           0.81
    SM Active Cycles                                                                 cycle                          13.87
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.13
    Achieved Active Warps Per SM                                                      warp                           1.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.52
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,773
    Memory [%]                                                                           %                           0.68
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          66.16
    L2 Cache Throughput                                                                  %                           0.68
    SM Active Cycles                                                                 cycle                          13.60
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.21
    Achieved Active Warps Per SM                                                      warp                           2.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.18
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                         25,032
    Memory [%]                                                                           %                          52.35
    DRAM Throughput                                                                      %                          52.35
    Duration                                                                       usecond                          18.34
    L1/TEX Cache Throughput                                                              %                          35.82
    L2 Cache Throughput                                                                  %                          30.73
    SM Active Cycles                                                                 cycle                      22,853.10
    Compute (SM) [%]                                                                     %                          65.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          69.66
    Achieved Active Warps Per SM                                                      warp                          33.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (69.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.49
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,847
    Memory [%]                                                                           %                           1.78
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                           3.92
    L2 Cache Throughput                                                                  %                           1.78
    SM Active Cycles                                                                 cycle                         559.21
    Compute (SM) [%]                                                                     %                           0.26
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          43
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,752
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.13
    Achieved Active Warps Per SM                                                      warp                           1.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.67
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,011
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           4.82
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         188.38
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           6
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,536
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.40
    Achieved Active Warps Per SM                                                      warp                           7.87
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.45
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          2,694
    Memory [%]                                                                           %                           0.70
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          67.25
    L2 Cache Throughput                                                                  %                           0.70
    SM Active Cycles                                                                 cycle                          13.38
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.07
    Achieved Active Warps Per SM                                                      warp                           1.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.99
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          8,375
    Memory [%]                                                                           %                          13.68
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                           6.21
    L1/TEX Cache Throughput                                                              %                          15.35
    L2 Cache Throughput                                                                  %                          13.68
    SM Active Cycles                                                                 cycle                       6,309.60
    Compute (SM) [%]                                                                     %                          48.28
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          71.81
    Achieved Active Warps Per SM                                                      warp                          34.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (71.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.86
    SM Frequency                                                             cycle/nsecond                           1.51
    Elapsed Cycles                                                                   cycle                          5,405
    Memory [%]                                                                           %                           2.62
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                           5.34
    L2 Cache Throughput                                                                  %                           2.19
    SM Active Cycles                                                                 cycle                       2,641.38
    Compute (SM) [%]                                                                     %                          21.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          28.46
    Achieved Active Warps Per SM                                                      warp                          13.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (28.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          4,089
    Memory [%]                                                                           %                           1.38
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.35
    SM Active Cycles                                                                 cycle                       1,496.04
    Compute (SM) [%]                                                                     %                          11.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.60
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.39
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          2,726
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          63.75
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          14.12
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.01
    Achieved Active Warps Per SM                                                      warp                           1.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.30
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                          2,743
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.40
    L1/TEX Cache Throughput                                                              %                          66.09
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.62
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.17
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                         25,082
    Memory [%]                                                                           %                          52.30
    DRAM Throughput                                                                      %                          52.30
    Duration                                                                       usecond                          18.37
    L1/TEX Cache Throughput                                                              %                          35.78
    L2 Cache Throughput                                                                  %                          30.72
    SM Active Cycles                                                                 cycle                      22,831.90
    Compute (SM) [%]                                                                     %                          65.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          70.64
    Achieved Active Warps Per SM                                                      warp                          33.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (70.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.54
    SM Frequency                                                             cycle/nsecond                           1.17
    Elapsed Cycles                                                                   cycle                          2,855
    Memory [%]                                                                           %                           1.77
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                           4.04
    L2 Cache Throughput                                                                  %                           1.77
    SM Active Cycles                                                                 cycle                         542.82
    Compute (SM) [%]                                                                     %                           0.26
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          43
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,752
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.18
    Achieved Active Warps Per SM                                                      warp                           2.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.00
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                         11,534
    Memory [%]                                                                           %                          17.31
    DRAM Throughput                                                                      %                           0.18
    Duration                                                                       usecond                           8.54
    L1/TEX Cache Throughput                                                              %                          19.72
    L2 Cache Throughput                                                                  %                          17.31
    SM Active Cycles                                                                 cycle                       9,463.37
    Compute (SM) [%]                                                                     %                          55.72
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          71.31
    Achieved Active Warps Per SM                                                      warp                          34.23
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (71.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.77
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,049
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.78
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         747.21
    Compute (SM) [%]                                                                     %                           5.78
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.50
    Achieved Active Warps Per SM                                                      warp                           7.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.32
    Elapsed Cycles                                                                   cycle                          4,073
    Memory [%]                                                                           %                           1.39
    DRAM Throughput                                                                      %                              0
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.78
    L2 Cache Throughput                                                                  %                           1.35
    SM Active Cycles                                                                 cycle                       1,495.24
    Compute (SM) [%]                                                                     %                          11.50
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.59
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.54
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          2,753
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          63.22
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          14.24
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           3.97
    Achieved Active Warps Per SM                                                      warp                           1.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.39
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,748
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          65.52
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.74
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.12
    Achieved Active Warps Per SM                                                      warp                           1.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.67
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,030
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         748.04
    Compute (SM) [%]                                                                     %                           5.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          24
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          6,144
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.60
    Achieved Active Warps Per SM                                                      warp                           7.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.67
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,117
    Memory [%]                                                                           %                           1.37
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.57
    L2 Cache Throughput                                                                  %                           1.34
    SM Active Cycles                                                                 cycle                       1,583.29
    Compute (SM) [%]                                                                     %                          11.36
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.69
    Achieved Active Warps Per SM                                                      warp                           7.53
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.35
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          2,729
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          66.02
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.63
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.15
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.44
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,748
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          66.31
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.57
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.18
    Achieved Active Warps Per SM                                                      warp                           2.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.77
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,072
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.98
    L1/TEX Cache Throughput                                                              %                           3.77
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         998.87
    Compute (SM) [%]                                                                     %                           7.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.67
    Achieved Active Warps Per SM                                                      warp                           8.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.16
    SM Frequency                                                             cycle/nsecond                           1.11
    Elapsed Cycles                                                                   cycle                          2,701
    Memory [%]                                                                           %                           0.70
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                          67.77
    L2 Cache Throughput                                                                  %                           0.70
    SM Active Cycles                                                                 cycle                          13.28
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.10
    Achieved Active Warps Per SM                                                      warp                           1.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.70
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          3,990
    Memory [%]                                                                           %                           1.20
    DRAM Throughput                                                                      %                              0
    Duration                                                                       usecond                           2.94
    L1/TEX Cache Throughput                                                              %                           4.88
    L2 Cache Throughput                                                                  %                           1.20
    SM Active Cycles                                                                 cycle                         186.38
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           6
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,536
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.65
    Achieved Active Warps Per SM                                                      warp                           7.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.41
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,715
    Memory [%]                                                                           %                           0.70
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          67.40
    L2 Cache Throughput                                                                  %                           0.70
    SM Active Cycles                                                                 cycle                          13.35
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           2.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.96
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          8,343
    Memory [%]                                                                           %                          13.69
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                           6.21
    L1/TEX Cache Throughput                                                              %                          15.41
    L2 Cache Throughput                                                                  %                          13.69
    SM Active Cycles                                                                 cycle                       6,298.04
    Compute (SM) [%]                                                                     %                          48.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          72.14
    Achieved Active Warps Per SM                                                      warp                          34.63
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (72.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.80
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,827
    Memory [%]                                                                           %                           2.93
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                           5.30
    L2 Cache Throughput                                                                  %                           2.44
    SM Active Cycles                                                                 cycle                       2,665.75
    Compute (SM) [%]                                                                     %                          24.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          28.27
    Achieved Active Warps Per SM                                                      warp                          13.57
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (28.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.60
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          4,084
    Memory [%]                                                                           %                           1.38
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           3.75
    L2 Cache Throughput                                                                  %                           1.34
    SM Active Cycles                                                                 cycle                       1,504.85
    Compute (SM) [%]                                                                     %                          11.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          48
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         12,288
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.41
    Achieved Active Warps Per SM                                                      warp                           7.88
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.54
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          2,746
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.34
    L1/TEX Cache Throughput                                                              %                          66.02
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.63
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.16
    Achieved Active Warps Per SM                                                      warp                           2.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.48
    SM Frequency                                                             cycle/nsecond                           1.16
    Elapsed Cycles                                                                   cycle                          2,756
    Memory [%]                                                                           %                           0.69
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.37
    L1/TEX Cache Throughput                                                              %                          65.11
    L2 Cache Throughput                                                                  %                           0.69
    SM Active Cycles                                                                 cycle                          13.82
    Compute (SM) [%]                                                                     %                           0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                             64
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.14
    Achieved Active Warps Per SM                                                      warp                           1.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.21
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                         25,079
    Memory [%]                                                                           %                          52.12
    DRAM Throughput                                                                      %                          52.12
    Duration                                                                       usecond                          18.34
    L1/TEX Cache Throughput                                                              %                          35.76
    L2 Cache Throughput                                                                  %                          30.61
    SM Active Cycles                                                                 cycle                      22,854.72
    Compute (SM) [%]                                                                     %                          65.86
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             27
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          69.72
    Achieved Active Warps Per SM                                                      warp                          33.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (69.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.58
    SM Frequency                                                             cycle/nsecond                           1.19
    Elapsed Cycles                                                                   cycle                          2,887
    Memory [%]                                                                           %                           1.77
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                           3.85
    L2 Cache Throughput                                                                  %                           1.77
    SM Active Cycles                                                                 cycle                         570.18
    Compute (SM) [%]                                                                     %                           0.26
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          43
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,752
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           3.97
    Achieved Active Warps Per SM                                                      warp                           1.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:56:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.95
    SM Frequency                                                             cycle/nsecond                           1.22
    Elapsed Cycles                                                                   cycle                          6,768
    Memory [%]                                                                           %                           9.18
    DRAM Throughput                                                                      %                           2.83
    Duration                                                                       usecond                           5.54
    L1/TEX Cache Throughput                                                              %                           8.93
    L2 Cache Throughput                                                                  %                           9.18
    SM Active Cycles                                                                 cycle                       4,508.90
    Compute (SM) [%]                                                                     %                          25.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.7 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         544
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         69,632
    Waves Per SM                                                                                                     0.67
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          55.75
    Achieved Active Warps Per SM                                                      warp                          26.76
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (55.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:56:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.20
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          6,046
    Memory [%]                                                                           %                          15.63
    DRAM Throughput                                                                      %                          15.51
    Duration                                                                       usecond                           4.83
    L1/TEX Cache Throughput                                                              %                          10.38
    L2 Cache Throughput                                                                  %                          15.63
    SM Active Cycles                                                                 cycle                       3,739.81
    Compute (SM) [%]                                                                     %                          24.79
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        104,448
    Waves Per SM                                                                                                        1
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          77.63
    Achieved Active Warps Per SM                                                      warp                          37.26
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (77.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.62
    SM Frequency                                                             cycle/nsecond                           1.19
    Elapsed Cycles                                                                   cycle                          2,894
    Memory [%]                                                                           %                           2.53
    DRAM Throughput                                                                      %                           0.02
    Duration                                                                       usecond                           2.43
    L1/TEX Cache Throughput                                                              %                           3.80
    L2 Cache Throughput                                                                  %                           2.53
    SM Active Cycles                                                                 cycle                         845.90
    Compute (SM) [%]                                                                     %                           0.37
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          64
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           3.94
    Achieved Active Warps Per SM                                                      warp                           1.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (3.9%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void transpose_readWrite_alignment_kernel<float, float, (int)1, (bool)0, (int)6, (int)5, (int)3>(cublasTransposeParams<T2>, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.32
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          5,709
    Memory [%]                                                                           %                           1.30
    DRAM Throughput                                                                      %                           0.84
    Duration                                                                       usecond                           4.42
    L1/TEX Cache Throughput                                                              %                          31.64
    L2 Cache Throughput                                                                  %                           1.30
    SM Active Cycles                                                                 cycle                         147.24
    Compute (SM) [%]                                                                     %                           0.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           3
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.32
    Threads                                                                         thread                            768
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          16.11
    Achieved Active Warps Per SM                                                      warp                           7.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (16.1%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void transpose_readWrite_alignment_kernel<float, float, (int)1, (bool)0, (int)6, (int)5, (int)3>(cublasTransposeParams<T2>, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          6,062
    Memory [%]                                                                           %                           1.71
    DRAM Throughput                                                                      %                           1.71
    Duration                                                                       usecond                           4.77
    L1/TEX Cache Throughput                                                              %                          36.39
    L2 Cache Throughput                                                                  %                           1.69
    SM Active Cycles                                                                 cycle                         186.21
    Compute (SM) [%]                                                                     %                           1.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           3
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.32
    Threads                                                                         thread                            768
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (16.0%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void transpose_readWrite_alignment_kernel<float, float, (int)1, (bool)0, (int)6, (int)5, (int)3>(cublasTransposeParams<T2>, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.88
    SM Frequency                                                             cycle/nsecond                           1.23
    Elapsed Cycles                                                                   cycle                          5,190
    Memory [%]                                                                           %                           1.43
    DRAM Throughput                                                                      %                           0.93
    Duration                                                                       usecond                           4.22
    L1/TEX Cache Throughput                                                              %                          31.99
    L2 Cache Throughput                                                                  %                           1.43
    SM Active Cycles                                                                 cycle                         145.62
    Compute (SM) [%]                                                                     %                           0.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           3
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.32
    Threads                                                                         thread                            768
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          16.55
    Achieved Active Warps Per SM                                                      warp                           7.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (16.6%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void transpose_readWrite_alignment_kernel<float, float, (int)1, (bool)0, (int)6, (int)5, (int)3>(cublasTransposeParams<T2>, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.08
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          5,942
    Memory [%]                                                                           %                           1.65
    DRAM Throughput                                                                      %                           1.62
    Duration                                                                       usecond                           4.70
    L1/TEX Cache Throughput                                                              %                          38.02
    L2 Cache Throughput                                                                  %                           1.65
    SM Active Cycles                                                                 cycle                         178.24
    Compute (SM) [%]                                                                     %                           1.14
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           3
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.32
    Threads                                                                         thread                            768
    Waves Per SM                                                                                                     0.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          16.63
    Achieved Active Warps Per SM                                                      warp                           7.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (16.6%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.11
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,122
    Memory [%]                                                                           %                           4.92
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          24.12
    L2 Cache Throughput                                                                  %                           4.92
    SM Active Cycles                                                                 cycle                         935.53
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.35
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.27
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          7,256
    Memory [%]                                                                           %                           4.83
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          23.78
    L2 Cache Throughput                                                                  %                           4.83
    SM Active Cycles                                                                 cycle                         950.69
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.51
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          8,773
    Memory [%]                                                                           %                           3.28
    DRAM Throughput                                                                      %                           1.83
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.97
    L2 Cache Throughput                                                                  %                           3.28
    SM Active Cycles                                                                 cycle                         586.49
    Compute (SM) [%]                                                                     %                           1.41
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:29, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.87
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          5,056
    Memory [%]                                                                           %                           8.93
    DRAM Throughput                                                                      %                           8.93
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.45
    L2 Cache Throughput                                                                  %                           7.89
    SM Active Cycles                                                                 cycle                       1,369.06
    Compute (SM) [%]                                                                     %                           2.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.93
    Achieved Active Warps Per SM                                                      warp                           7.65
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:29, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.92
    SM Frequency                                                             cycle/nsecond                           1.23
    Elapsed Cycles                                                                   cycle                          8,302
    Memory [%]                                                                           %                           3.47
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.72
    L1/TEX Cache Throughput                                                              %                          25.39
    L2 Cache Throughput                                                                  %                           3.47
    SM Active Cycles                                                                 cycle                         577.35
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:29, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.65
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,926
    Memory [%]                                                                           %                           9.16
    DRAM Throughput                                                                      %                           9.16
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.94
    L2 Cache Throughput                                                                  %                           8.10
    SM Active Cycles                                                                 cycle                       1,307.31
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.85
    Achieved Active Warps Per SM                                                      warp                           8.09
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:30, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.40
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,716
    Memory [%]                                                                           %                           3.30
    DRAM Throughput                                                                      %                           1.83
    Duration                                                                       usecond                           6.66
    L1/TEX Cache Throughput                                                              %                          24.98
    L2 Cache Throughput                                                                  %                           3.30
    SM Active Cycles                                                                 cycle                         585.04
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:30, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,993
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.37
    L2 Cache Throughput                                                                  %                           7.97
    SM Active Cycles                                                                 cycle                       1,380.16
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.99
    Achieved Active Warps Per SM                                                      warp                           7.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:30, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.20
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,414
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.06
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         584.99
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:30, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.91
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,907
    Memory [%]                                                                           %                           9.12
    DRAM Throughput                                                                      %                           9.12
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.48
    L2 Cache Throughput                                                                  %                           8.10
    SM Active Cycles                                                                 cycle                       1,365.68
    Compute (SM) [%]                                                                     %                           3.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.93
    Achieved Active Warps Per SM                                                      warp                           7.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.18
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,110
    Memory [%]                                                                           %                           4.94
    DRAM Throughput                                                                      %                           1.58
    Duration                                                                       usecond                           5.57
    L1/TEX Cache Throughput                                                              %                          23.59
    L2 Cache Throughput                                                                  %                           4.94
    SM Active Cycles                                                                 cycle                         941.12
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.23
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,243
    Memory [%]                                                                           %                           4.85
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.24
    L2 Cache Throughput                                                                  %                           4.85
    SM Active Cycles                                                                 cycle                         964.50
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.36
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          8,584
    Memory [%]                                                                           %                           3.35
    DRAM Throughput                                                                      %                           1.86
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.75
    L2 Cache Throughput                                                                  %                           3.35
    SM Active Cycles                                                                 cycle                         593.90
    Compute (SM) [%]                                                                     %                           1.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.37
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:31, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.20
    SM Frequency                                                             cycle/nsecond                           1.44
    Elapsed Cycles                                                                   cycle                          5,199
    Memory [%]                                                                           %                           8.68
    DRAM Throughput                                                                      %                           8.68
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.50
    L2 Cache Throughput                                                                  %                           7.66
    SM Active Cycles                                                                 cycle                       1,362.65
    Compute (SM) [%]                                                                     %                           2.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.31
    Achieved Active Warps Per SM                                                      warp                           7.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:32, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,333
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.31
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         578.85
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:32, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.77
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          5,054
    Memory [%]                                                                           %                           8.87
    DRAM Throughput                                                                      %                           8.87
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.40
    L2 Cache Throughput                                                                  %                           7.88
    SM Active Cycles                                                                 cycle                       1,376.18
    Compute (SM) [%]                                                                     %                           2.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.80
    Achieved Active Warps Per SM                                                      warp                           7.58
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:32, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.39
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,637
    Memory [%]                                                                           %                           3.33
    DRAM Throughput                                                                      %                           1.85
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.74
    L2 Cache Throughput                                                                  %                           3.33
    SM Active Cycles                                                                 cycle                         591.04
    Compute (SM) [%]                                                                     %                           1.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:33, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.48
    SM Frequency                                                             cycle/nsecond                           1.32
    Elapsed Cycles                                                                   cycle                          5,065
    Memory [%]                                                                           %                           8.87
    DRAM Throughput                                                                      %                           8.87
    Duration                                                                       usecond                           3.84
    L1/TEX Cache Throughput                                                              %                          10.35
    L2 Cache Throughput                                                                  %                           7.87
    SM Active Cycles                                                                 cycle                       1,381.91
    Compute (SM) [%]                                                                     %                           2.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.70
    Achieved Active Warps Per SM                                                      warp                           7.54
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:33, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.05
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          8,319
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          25.29
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         578.35
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:33, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.11
    SM Frequency                                                             cycle/nsecond                           1.42
    Elapsed Cycles                                                                   cycle                          5,062
    Memory [%]                                                                           %                           8.93
    DRAM Throughput                                                                      %                           8.93
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.48
    L2 Cache Throughput                                                                  %                           7.84
    SM Active Cycles                                                                 cycle                       1,365.50
    Compute (SM) [%]                                                                     %                           2.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.78
    Achieved Active Warps Per SM                                                      warp                           7.57
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:33, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.09
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,080
    Memory [%]                                                                           %                           4.96
    DRAM Throughput                                                                      %                           1.58
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          24.20
    L2 Cache Throughput                                                                  %                           4.96
    SM Active Cycles                                                                 cycle                         935.44
    Compute (SM) [%]                                                                     %                           2.76
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:34, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.14
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,190
    Memory [%]                                                                           %                           4.88
    DRAM Throughput                                                                      %                           1.56
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.61
    L2 Cache Throughput                                                                  %                           4.88
    SM Active Cycles                                                                 cycle                         955.99
    Compute (SM) [%]                                                                     %                           2.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:34, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.22
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,487
    Memory [%]                                                                           %                           3.39
    DRAM Throughput                                                                      %                           1.88
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          25.02
    L2 Cache Throughput                                                                  %                           3.39
    SM Active Cycles                                                                 cycle                         585.12
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:34, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.78
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,941
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.46
    L2 Cache Throughput                                                                  %                           8.05
    SM Active Cycles                                                                 cycle                       1,367.53
    Compute (SM) [%]                                                                     %                           2.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.13
    Achieved Active Warps Per SM                                                      warp                           7.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:35, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.26
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,367
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.27
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         580.94
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.79
    Achieved Active Warps Per SM                                                      warp                           4.22
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:35, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                          10.03
    SM Frequency                                                             cycle/nsecond                           1.56
    Elapsed Cycles                                                                   cycle                          5,899
    Memory [%]                                                                           %                           7.62
    DRAM Throughput                                                                      %                           7.62
    Duration                                                                       usecond                           3.78
    L1/TEX Cache Throughput                                                              %                          10.09
    L2 Cache Throughput                                                                  %                           6.72
    SM Active Cycles                                                                 cycle                       1,417.19
    Compute (SM) [%]                                                                     %                           2.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.97
    Achieved Active Warps Per SM                                                      warp                           7.67
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:35, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.26
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,413
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.00
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         586.60
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.25
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:35, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.86
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,995
    Memory [%]                                                                           %                           9.02
    DRAM Throughput                                                                      %                           9.02
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.58
    L2 Cache Throughput                                                                  %                           7.96
    SM Active Cycles                                                                 cycle                       1,352.21
    Compute (SM) [%]                                                                     %                           2.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:36, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.39
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,549
    Memory [%]                                                                           %                           3.37
    DRAM Throughput                                                                      %                           1.87
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.18
    L2 Cache Throughput                                                                  %                           3.37
    SM Active Cycles                                                                 cycle                         579.01
    Compute (SM) [%]                                                                     %                           1.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:36, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.83
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,920
    Memory [%]                                                                           %                           9.12
    DRAM Throughput                                                                      %                           9.12
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.61
    L2 Cache Throughput                                                                  %                           8.10
    SM Active Cycles                                                                 cycle                       1,348.37
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.06
    Achieved Active Warps Per SM                                                      warp                           7.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:36, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.28
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,159
    Memory [%]                                                                           %                           4.90
    DRAM Throughput                                                                      %                           1.56
    Duration                                                                       usecond                           5.57
    L1/TEX Cache Throughput                                                              %                          23.87
    L2 Cache Throughput                                                                  %                           4.90
    SM Active Cycles                                                                 cycle                         945.26
    Compute (SM) [%]                                                                     %                           2.72
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.17
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,221
    Memory [%]                                                                           %                           4.87
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.58
    L2 Cache Throughput                                                                  %                           4.87
    SM Active Cycles                                                                 cycle                         955.88
    Compute (SM) [%]                                                                     %                           2.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.18
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,361
    Memory [%]                                                                           %                           3.49
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.07
    L2 Cache Throughput                                                                  %                           3.49
    SM Active Cycles                                                                 cycle                         584.38
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:37, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.40
    SM Frequency                                                             cycle/nsecond                           1.46
    Elapsed Cycles                                                                   cycle                          5,188
    Memory [%]                                                                           %                           8.65
    DRAM Throughput                                                                      %                           8.65
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.69
    L2 Cache Throughput                                                                  %                           7.65
    SM Active Cycles                                                                 cycle                       1,338.46
    Compute (SM) [%]                                                                     %                           2.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.30
    Achieved Active Warps Per SM                                                      warp                           7.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.21
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,300
    Memory [%]                                                                           %                           3.47
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.27
    L2 Cache Throughput                                                                  %                           3.47
    SM Active Cycles                                                                 cycle                         579.13
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:38, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.65
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,902
    Memory [%]                                                                           %                           9.15
    DRAM Throughput                                                                      %                           9.15
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                           9.51
    L2 Cache Throughput                                                                  %                           8.12
    SM Active Cycles                                                                 cycle                       1,504.65
    Compute (SM) [%]                                                                     %                           3.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          14.42
    Achieved Active Warps Per SM                                                      warp                           6.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (14.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:38, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.05
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          8,362
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.66
    L1/TEX Cache Throughput                                                              %                          24.84
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         588.41
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:38, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,982
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.42
    L2 Cache Throughput                                                                  %                           8.01
    SM Active Cycles                                                                 cycle                       1,372.88
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.96
    Achieved Active Warps Per SM                                                      warp                           7.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:38, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.26
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,367
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.25
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         580.40
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:39, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.88
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,921
    Memory [%]                                                                           %                           9.15
    DRAM Throughput                                                                      %                           9.15
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.82
    L2 Cache Throughput                                                                  %                           8.10
    SM Active Cycles                                                                 cycle                       1,322.51
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.36
    Achieved Active Warps Per SM                                                      warp                           7.85
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:39, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,122
    Memory [%]                                                                           %                           4.93
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.60
    L1/TEX Cache Throughput                                                              %                          24.23
    L2 Cache Throughput                                                                  %                           4.93
    SM Active Cycles                                                                 cycle                         938.99
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:39, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.13
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,208
    Memory [%]                                                                           %                           4.87
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.70
    L1/TEX Cache Throughput                                                              %                          23.75
    L2 Cache Throughput                                                                  %                           4.87
    SM Active Cycles                                                                 cycle                         950.68
    Compute (SM) [%]                                                                     %                           2.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:40, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.10
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          8,427
    Memory [%]                                                                           %                           3.41
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.69
    L1/TEX Cache Throughput                                                              %                          24.96
    L2 Cache Throughput                                                                  %                           3.41
    SM Active Cycles                                                                 cycle                         585.49
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:40, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.94
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                          4,983
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.35
    L2 Cache Throughput                                                                  %                           7.99
    SM Active Cycles                                                                 cycle                       1,381.81
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.01
    Achieved Active Warps Per SM                                                      warp                           7.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:40, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.28
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,378
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.22
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         580.43
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.36
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:40, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.70
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,937
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.62
    L2 Cache Throughput                                                                  %                           8.06
    SM Active Cycles                                                                 cycle                       1,347.47
    Compute (SM) [%]                                                                     %                           2.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.83
    Achieved Active Warps Per SM                                                      warp                           7.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:41, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.23
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,458
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.90
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         584.41
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.38
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:41, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.81
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,983
    Memory [%]                                                                           %                           9.23
    DRAM Throughput                                                                      %                           9.23
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                           9.33
    L2 Cache Throughput                                                                  %                           8.06
    SM Active Cycles                                                                 cycle                       1,533.21
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          14.30
    Achieved Active Warps Per SM                                                      warp                           6.86
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (14.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:41, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,327
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.23
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         578.72
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:42, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.88
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          4,924
    Memory [%]                                                                           %                           9.16
    DRAM Throughput                                                                      %                           9.16
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.24
    L2 Cache Throughput                                                                  %                           8.11
    SM Active Cycles                                                                 cycle                       1,397.03
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.36
    Achieved Active Warps Per SM                                                      warp                           7.85
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:42, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.18
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,114
    Memory [%]                                                                           %                           4.93
    DRAM Throughput                                                                      %                           1.58
    Duration                                                                       usecond                           5.57
    L1/TEX Cache Throughput                                                              %                          23.82
    L2 Cache Throughput                                                                  %                           4.93
    SM Active Cycles                                                                 cycle                         944.28
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:42, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.11
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,207
    Memory [%]                                                                           %                           4.87
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.70
    L1/TEX Cache Throughput                                                              %                          23.61
    L2 Cache Throughput                                                                  %                           4.87
    SM Active Cycles                                                                 cycle                         958.75
    Compute (SM) [%]                                                                     %                           2.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.43
    Achieved Active Warps Per SM                                                      warp                           4.05
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:43, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.05
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          8,306
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          25.03
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         584.66
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:43, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.84
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          5,015
    Memory [%]                                                                           %                           8.95
    DRAM Throughput                                                                      %                           8.95
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.58
    L2 Cache Throughput                                                                  %                           7.94
    SM Active Cycles                                                                 cycle                       1,352.43
    Compute (SM) [%]                                                                     %                           2.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.62
    Achieved Active Warps Per SM                                                      warp                           7.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:43, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.13
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,323
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.18
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         579.53
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.34
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:43, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.71
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                          4,877
    Memory [%]                                                                           %                           9.24
    DRAM Throughput                                                                      %                           9.24
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.90
    L2 Cache Throughput                                                                  %                           8.16
    SM Active Cycles                                                                 cycle                       1,312.97
    Compute (SM) [%]                                                                     %                           3.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.36
    Achieved Active Warps Per SM                                                      warp                           7.85
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:44, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.14
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,375
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.01
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         585.60
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:44, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.57
    SM Frequency                                                             cycle/nsecond                           1.49
    Elapsed Cycles                                                                   cycle                          5,389
    Memory [%]                                                                           %                           8.35
    DRAM Throughput                                                                      %                           8.35
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.67
    L2 Cache Throughput                                                                  %                           7.38
    SM Active Cycles                                                                 cycle                       1,341.12
    Compute (SM) [%]                                                                     %                           2.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.23
    Achieved Active Warps Per SM                                                      warp                           7.79
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:44, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.02
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          8,346
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.69
    L1/TEX Cache Throughput                                                              %                          25.27
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         579.63
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:44, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.05
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                          4,964
    Memory [%]                                                                           %                           9.07
    DRAM Throughput                                                                      %                           9.07
    Duration                                                                       usecond                           3.52
    L1/TEX Cache Throughput                                                              %                          10.42
    L2 Cache Throughput                                                                  %                           8.03
    SM Active Cycles                                                                 cycle                       1,373.07
    Compute (SM) [%]                                                                     %                           2.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.75
    Achieved Active Warps Per SM                                                      warp                           7.56
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.09
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,127
    Memory [%]                                                                           %                           4.98
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          24.00
    L2 Cache Throughput                                                                  %                           4.98
    SM Active Cycles                                                                 cycle                         944.68
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.24
    Achieved Active Warps Per SM                                                      warp                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.25
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,224
    Memory [%]                                                                           %                           4.85
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          23.91
    L2 Cache Throughput                                                                  %                           4.85
    SM Active Cycles                                                                 cycle                         946.29
    Compute (SM) [%]                                                                     %                           2.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.35
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,434
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          24.43
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         598.07
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.10
    Achieved Active Warps Per SM                                                      warp                           3.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:46, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.94
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                          4,971
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.40
    L2 Cache Throughput                                                                  %                           8.01
    SM Active Cycles                                                                 cycle                       1,375.13
    Compute (SM) [%]                                                                     %                           2.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.11
    Achieved Active Warps Per SM                                                      warp                           7.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:46, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.18
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,317
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.34
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         577.41
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:46, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.81
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,989
    Memory [%]                                                                           %                           8.98
    DRAM Throughput                                                                      %                           8.98
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.39
    L2 Cache Throughput                                                                  %                           7.97
    SM Active Cycles                                                                 cycle                       1,376.72
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.97
    Achieved Active Warps Per SM                                                      warp                           7.67
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:46, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,411
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.17
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         582.81
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:47, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.70
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,925
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.54
    L2 Cache Throughput                                                                  %                           8.08
    SM Active Cycles                                                                 cycle                       1,357.28
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.15
    Achieved Active Warps Per SM                                                      warp                           7.75
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:47, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,337
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          23.63
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         620.59
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           7.73
    Achieved Active Warps Per SM                                                      warp                           3.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:47, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.96
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                          4,913
    Memory [%]                                                                           %                           9.16
    DRAM Throughput                                                                      %                           9.16
    Duration                                                                       usecond                           3.52
    L1/TEX Cache Throughput                                                              %                          10.75
    L2 Cache Throughput                                                                  %                           8.04
    SM Active Cycles                                                                 cycle                       1,330.78
    Compute (SM) [%]                                                                     %                           3.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.43
    Achieved Active Warps Per SM                                                      warp                           7.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:47, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.17
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,130
    Memory [%]                                                                           %                           4.91
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.60
    L1/TEX Cache Throughput                                                              %                          24.06
    L2 Cache Throughput                                                                  %                           4.91
    SM Active Cycles                                                                 cycle                         944.53
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:48, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.23
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,252
    Memory [%]                                                                           %                           4.84
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.71
    L2 Cache Throughput                                                                  %                           4.84
    SM Active Cycles                                                                 cycle                         952.57
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:48, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.13
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,421
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          25.11
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         583.68
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:48, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.66
    SM Frequency                                                             cycle/nsecond                           1.50
    Elapsed Cycles                                                                   cycle                          5,485
    Memory [%]                                                                           %                           8.19
    DRAM Throughput                                                                      %                           8.19
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.33
    L2 Cache Throughput                                                                  %                           7.23
    SM Active Cycles                                                                 cycle                       1,384.63
    Compute (SM) [%]                                                                     %                           2.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.65
    Achieved Active Warps Per SM                                                      warp                           7.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:49, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,345
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.33
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         578.88
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:49, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.68
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          5,071
    Memory [%]                                                                           %                           8.81
    DRAM Throughput                                                                      %                           8.81
    Duration                                                                       usecond                           3.78
    L1/TEX Cache Throughput                                                              %                           9.80
    L2 Cache Throughput                                                                  %                           7.81
    SM Active Cycles                                                                 cycle                       1,459.62
    Compute (SM) [%]                                                                     %                           2.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.85
    Achieved Active Warps Per SM                                                      warp                           7.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:49, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.25
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,479
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.98
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         588.66
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.74
    Achieved Active Warps Per SM                                                      warp                           4.20
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:49, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.71
    SM Frequency                                                             cycle/nsecond                           1.51
    Elapsed Cycles                                                                   cycle                          5,522
    Memory [%]                                                                           %                           8.15
    DRAM Throughput                                                                      %                           8.15
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.60
    L2 Cache Throughput                                                                  %                           7.21
    SM Active Cycles                                                                 cycle                       1,349.65
    Compute (SM) [%]                                                                     %                           2.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.29
    Achieved Active Warps Per SM                                                      warp                           7.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:50, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.96
    SM Frequency                                                             cycle/nsecond                           1.24
    Elapsed Cycles                                                                   cycle                          8,301
    Memory [%]                                                                           %                           3.47
    DRAM Throughput                                                                      %                           1.93
    Duration                                                                       usecond                           6.69
    L1/TEX Cache Throughput                                                              %                          25.48
    L2 Cache Throughput                                                                  %                           3.47
    SM Active Cycles                                                                 cycle                         575.99
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.39
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:50, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.29
    SM Frequency                                                             cycle/nsecond                           1.44
    Elapsed Cycles                                                                   cycle                          5,225
    Memory [%]                                                                           %                           8.60
    DRAM Throughput                                                                      %                           8.60
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.72
    L2 Cache Throughput                                                                  %                           7.61
    SM Active Cycles                                                                 cycle                       1,334.28
    Compute (SM) [%]                                                                     %                           2.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.36
    Achieved Active Warps Per SM                                                      warp                           7.85
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:50, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.09
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,107
    Memory [%]                                                                           %                           4.94
    DRAM Throughput                                                                      %                           1.58
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          24.14
    L2 Cache Throughput                                                                  %                           4.94
    SM Active Cycles                                                                 cycle                         935.93
    Compute (SM) [%]                                                                     %                           2.75
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:51, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,225
    Memory [%]                                                                           %                           4.86
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.86
    L2 Cache Throughput                                                                  %                           4.86
    SM Active Cycles                                                                 cycle                         951.41
    Compute (SM) [%]                                                                     %                           2.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:51, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.22
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,364
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.07
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         584.26
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.46
    Achieved Active Warps Per SM                                                      warp                           4.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:51, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.86
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,925
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.59
    L2 Cache Throughput                                                                  %                           8.07
    SM Active Cycles                                                                 cycle                       1,351.25
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.57
    Achieved Active Warps Per SM                                                      warp                           7.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:52, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.12
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          8,341
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.32
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         578.46
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:52, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.84
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,965
    Memory [%]                                                                           %                           9.03
    DRAM Throughput                                                                      %                           9.03
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.68
    L2 Cache Throughput                                                                  %                           8.01
    SM Active Cycles                                                                 cycle                       1,339.46
    Compute (SM) [%]                                                                     %                           2.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.14
    Achieved Active Warps Per SM                                                      warp                           7.75
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:52, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.35
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          8,485
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.88
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          24.83
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         588.57
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.48
    Achieved Active Warps Per SM                                                      warp                           4.07
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:52, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.05
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                          5,240
    Memory [%]                                                                           %                           8.59
    DRAM Throughput                                                                      %                           8.59
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.42
    L2 Cache Throughput                                                                  %                           7.60
    SM Active Cycles                                                                 cycle                       1,372.66
    Compute (SM) [%]                                                                     %                           2.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.25
    Achieved Active Warps Per SM                                                      warp                           7.80
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.31
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          8,476
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          24.39
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         598.53
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.16
    Achieved Active Warps Per SM                                                      warp                           3.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:53, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.97
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                          4,955
    Memory [%]                                                                           %                           9.06
    DRAM Throughput                                                                      %                           9.06
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.74
    L2 Cache Throughput                                                                  %                           8.09
    SM Active Cycles                                                                 cycle                       1,331.51
    Compute (SM) [%]                                                                     %                           2.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.42
    Achieved Active Warps Per SM                                                      warp                           7.88
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,109
    Memory [%]                                                                           %                           4.94
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.60
    L1/TEX Cache Throughput                                                              %                          24.07
    L2 Cache Throughput                                                                  %                           4.94
    SM Active Cycles                                                                 cycle                         938.87
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,249
    Memory [%]                                                                           %                           4.85
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.40
    L2 Cache Throughput                                                                  %                           4.85
    SM Active Cycles                                                                 cycle                         955.46
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:54, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.46
    SM Frequency                                                             cycle/nsecond                           1.32
    Elapsed Cycles                                                                   cycle                          8,731
    Memory [%]                                                                           %                           3.34
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          24.69
    L2 Cache Throughput                                                                  %                           3.34
    SM Active Cycles                                                                 cycle                         591.66
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.15
    Achieved Active Warps Per SM                                                      warp                           3.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:54, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.91
    SM Frequency                                                             cycle/nsecond                           1.54
    Elapsed Cycles                                                                   cycle                          5,538
    Memory [%]                                                                           %                           8.12
    DRAM Throughput                                                                      %                           8.12
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.62
    L2 Cache Throughput                                                                  %                           7.20
    SM Active Cycles                                                                 cycle                       1,346.88
    Compute (SM) [%]                                                                     %                           2.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.46
    Achieved Active Warps Per SM                                                      warp                           7.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:54, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.59
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          8,690
    Memory [%]                                                                           %                           3.31
    DRAM Throughput                                                                      %                           1.84
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          24.52
    L2 Cache Throughput                                                                  %                           3.31
    SM Active Cycles                                                                 cycle                         596.35
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:55, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.40
    SM Frequency                                                             cycle/nsecond                           1.47
    Elapsed Cycles                                                                   cycle                          5,360
    Memory [%]                                                                           %                           8.42
    DRAM Throughput                                                                      %                           8.42
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.58
    L2 Cache Throughput                                                                  %                           7.44
    SM Active Cycles                                                                 cycle                       1,352.68
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.92
    Achieved Active Warps Per SM                                                      warp                           7.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:55, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,388
    Memory [%]                                                                           %                           3.43
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.10
    L2 Cache Throughput                                                                  %                           3.43
    SM Active Cycles                                                                 cycle                         585.76
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:55, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.68
    SM Frequency                                                             cycle/nsecond                           1.50
    Elapsed Cycles                                                                   cycle                          5,480
    Memory [%]                                                                           %                           8.31
    DRAM Throughput                                                                      %                           8.31
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.52
    L2 Cache Throughput                                                                  %                           7.26
    SM Active Cycles                                                                 cycle                       1,359.32
    Compute (SM) [%]                                                                     %                           2.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:55, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.67
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          8,717
    Memory [%]                                                                           %                           3.30
    DRAM Throughput                                                                      %                           1.83
    Duration                                                                       usecond                           6.46
    L1/TEX Cache Throughput                                                              %                          24.98
    L2 Cache Throughput                                                                  %                           3.30
    SM Active Cycles                                                                 cycle                         585.26
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:56, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.99
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                          4,926
    Memory [%]                                                                           %                           9.13
    DRAM Throughput                                                                      %                           9.13
    Duration                                                                       usecond                           3.52
    L1/TEX Cache Throughput                                                              %                          10.75
    L2 Cache Throughput                                                                  %                           8.07
    SM Active Cycles                                                                 cycle                       1,331.12
    Compute (SM) [%]                                                                     %                           2.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          18.08
    Achieved Active Warps Per SM                                                      warp                           8.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (18.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:56, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,143
    Memory [%]                                                                           %                           4.91
    DRAM Throughput                                                                      %                           1.56
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          24.06
    L2 Cache Throughput                                                                  %                           4.91
    SM Active Cycles                                                                 cycle                         941.01
    Compute (SM) [%]                                                                     %                           2.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:56, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.23
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,268
    Memory [%]                                                                           %                           4.84
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.81
    L2 Cache Throughput                                                                  %                           4.84
    SM Active Cycles                                                                 cycle                         949.31
    Compute (SM) [%]                                                                     %                           2.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:57, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.20
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,385
    Memory [%]                                                                           %                           3.43
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          24.81
    L2 Cache Throughput                                                                  %                           3.43
    SM Active Cycles                                                                 cycle                         591.34
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.25
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:57, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.46
    SM Frequency                                                             cycle/nsecond                           1.47
    Elapsed Cycles                                                                   cycle                          5,375
    Memory [%]                                                                           %                           8.37
    DRAM Throughput                                                                      %                           8.37
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.43
    L2 Cache Throughput                                                                  %                           7.38
    SM Active Cycles                                                                 cycle                       1,372.13
    Compute (SM) [%]                                                                     %                           2.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.28
    Achieved Active Warps Per SM                                                      warp                           7.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:57, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.12
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          8,358
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          25.34
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         578.15
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:57, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.11
    SM Frequency                                                             cycle/nsecond                           1.42
    Elapsed Cycles                                                                   cycle                          5,085
    Memory [%]                                                                           %                           8.84
    DRAM Throughput                                                                      %                           8.84
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.78
    L2 Cache Throughput                                                                  %                           7.82
    SM Active Cycles                                                                 cycle                       1,327.40
    Compute (SM) [%]                                                                     %                           2.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:58, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.27
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,462
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          23.48
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         622.18
    Compute (SM) [%]                                                                     %                           1.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           7.79
    Achieved Active Warps Per SM                                                      warp                           3.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:58, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.69
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          5,020
    Memory [%]                                                                           %                           8.95
    DRAM Throughput                                                                      %                           8.95
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.47
    L2 Cache Throughput                                                                  %                           7.95
    SM Active Cycles                                                                 cycle                       1,366.59
    Compute (SM) [%]                                                                     %                           2.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:58, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,308
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.23
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         579.07
    Compute (SM) [%]                                                                     %                           1.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.39
    Achieved Active Warps Per SM                                                      warp                           4.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:58, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.80
    SM Frequency                                                             cycle/nsecond                           1.53
    Elapsed Cycles                                                                   cycle                          5,485
    Memory [%]                                                                           %                           8.22
    DRAM Throughput                                                                      %                           8.22
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.29
    L2 Cache Throughput                                                                  %                           7.23
    SM Active Cycles                                                                 cycle                       1,390.19
    Compute (SM) [%]                                                                     %                           2.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.59
    Achieved Active Warps Per SM                                                      warp                           7.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.17
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,104
    Memory [%]                                                                           %                           4.94
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.60
    L1/TEX Cache Throughput                                                              %                          23.87
    L2 Cache Throughput                                                                  %                           4.94
    SM Active Cycles                                                                 cycle                         945.35
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.23
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,260
    Memory [%]                                                                           %                           4.84
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.53
    L2 Cache Throughput                                                                  %                           4.84
    SM Active Cycles                                                                 cycle                         959.60
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.13
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,410
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          24.92
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         587.07
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.24
    Achieved Active Warps Per SM                                                      warp                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:00, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.05
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                          5,015
    Memory [%]                                                                           %                           8.99
    DRAM Throughput                                                                      %                           8.99
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.33
    L2 Cache Throughput                                                                  %                           7.94
    SM Active Cycles                                                                 cycle                       1,385.03
    Compute (SM) [%]                                                                     %                           2.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.75
    Achieved Active Warps Per SM                                                      warp                           7.56
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:00, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.55
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                          8,729
    Memory [%]                                                                           %                           3.30
    DRAM Throughput                                                                      %                           1.82
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.14
    L2 Cache Throughput                                                                  %                           3.30
    SM Active Cycles                                                                 cycle                         582.37
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:00, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.68
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,965
    Memory [%]                                                                           %                           9.04
    DRAM Throughput                                                                      %                           9.04
    Duration                                                                       usecond                           3.68
    L1/TEX Cache Throughput                                                              %                          10.43
    L2 Cache Throughput                                                                  %                           7.99
    SM Active Cycles                                                                 cycle                       1,371.74
    Compute (SM) [%]                                                                     %                           2.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.91
    Achieved Active Warps Per SM                                                      warp                           7.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:01, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,371
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          24.85
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         588.25
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.23
    Achieved Active Warps Per SM                                                      warp                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:01, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.41
    SM Frequency                                                             cycle/nsecond                           1.47
    Elapsed Cycles                                                                   cycle                          5,447
    Memory [%]                                                                           %                           8.27
    DRAM Throughput                                                                      %                           8.27
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.55
    L2 Cache Throughput                                                                  %                           7.31
    SM Active Cycles                                                                 cycle                       1,356.51
    Compute (SM) [%]                                                                     %                           2.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.16
    Achieved Active Warps Per SM                                                      warp                           7.76
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:01, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.03
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          8,328
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.66
    L1/TEX Cache Throughput                                                              %                          25.25
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         578.56
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:01, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.14
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                          5,076
    Memory [%]                                                                           %                           8.81
    DRAM Throughput                                                                      %                           8.81
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.43
    L2 Cache Throughput                                                                  %                           7.82
    SM Active Cycles                                                                 cycle                       1,372.21
    Compute (SM) [%]                                                                     %                           2.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.84
    Achieved Active Warps Per SM                                                      warp                           7.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.15
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,145
    Memory [%]                                                                           %                           4.92
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          23.93
    L2 Cache Throughput                                                                  %                           4.92
    SM Active Cycles                                                                 cycle                         943.35
    Compute (SM) [%]                                                                     %                           2.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.13
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,237
    Memory [%]                                                                           %                           4.85
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.73
    L1/TEX Cache Throughput                                                              %                          23.46
    L2 Cache Throughput                                                                  %                           4.85
    SM Active Cycles                                                                 cycle                         956.47
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.16
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,375
    Memory [%]                                                                           %                           3.43
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          25.14
    L2 Cache Throughput                                                                  %                           3.43
    SM Active Cycles                                                                 cycle                         583.21
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:02, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.71
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,961
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.68
    L1/TEX Cache Throughput                                                              %                          10.23
    L2 Cache Throughput                                                                  %                           8.02
    SM Active Cycles                                                                 cycle                          1,398
    Compute (SM) [%]                                                                     %                           2.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.00
    Achieved Active Warps Per SM                                                      warp                           7.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:03, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.28
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,377
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.18
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         581.99
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:03, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.86
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                          5,127
    Memory [%]                                                                           %                           8.78
    DRAM Throughput                                                                      %                           8.78
    Duration                                                                       usecond                           3.71
    L1/TEX Cache Throughput                                                              %                          10.00
    L2 Cache Throughput                                                                  %                           7.75
    SM Active Cycles                                                                 cycle                       1,430.40
    Compute (SM) [%]                                                                     %                           2.87
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.00
    Achieved Active Warps Per SM                                                      warp                           7.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:03, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.24
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,423
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.15
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         582.43
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.36
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:04, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.65
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,947
    Memory [%]                                                                           %                           9.06
    DRAM Throughput                                                                      %                           9.06
    Duration                                                                       usecond                           3.68
    L1/TEX Cache Throughput                                                              %                          10.64
    L2 Cache Throughput                                                                  %                           8.04
    SM Active Cycles                                                                 cycle                       1,344.26
    Compute (SM) [%]                                                                     %                           2.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.33
    Achieved Active Warps Per SM                                                      warp                           7.84
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:04, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.21
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,353
    Memory [%]                                                                           %                           3.45
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.33
    L2 Cache Throughput                                                                  %                           3.45
    SM Active Cycles                                                                 cycle                         579.53
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:04, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.94
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                          4,945
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.55
    L1/TEX Cache Throughput                                                              %                          10.70
    L2 Cache Throughput                                                                  %                           8.04
    SM Active Cycles                                                                 cycle                       1,336.56
    Compute (SM) [%]                                                                     %                           2.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:04, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,158
    Memory [%]                                                                           %                           4.91
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.60
    L1/TEX Cache Throughput                                                              %                          23.86
    L2 Cache Throughput                                                                  %                           4.91
    SM Active Cycles                                                                 cycle                         947.25
    Compute (SM) [%]                                                                     %                           2.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:05, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.25
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          7,246
    Memory [%]                                                                           %                           4.85
    DRAM Throughput                                                                      %                           1.54
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                          23.47
    L2 Cache Throughput                                                                  %                           4.85
    SM Active Cycles                                                                 cycle                         958.79
    Compute (SM) [%]                                                                     %                           2.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:05, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.10
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          8,361
    Memory [%]                                                                           %                           3.44
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.62
    L1/TEX Cache Throughput                                                              %                          24.95
    L2 Cache Throughput                                                                  %                           3.44
    SM Active Cycles                                                                 cycle                         586.32
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:05, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.81
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,958
    Memory [%]                                                                           %                           9.07
    DRAM Throughput                                                                      %                           9.07
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.66
    L2 Cache Throughput                                                                  %                           8.03
    SM Active Cycles                                                                 cycle                       1,341.53
    Compute (SM) [%]                                                                     %                           2.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.32
    Achieved Active Warps Per SM                                                      warp                           7.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:06, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.38
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,599
    Memory [%]                                                                           %                           3.35
    DRAM Throughput                                                                      %                           1.86
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.03
    L2 Cache Throughput                                                                  %                           3.35
    SM Active Cycles                                                                 cycle                         586.69
    Compute (SM) [%]                                                                     %                           1.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.35
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:06, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                             10
    SM Frequency                                                             cycle/nsecond                           1.55
    Elapsed Cycles                                                                   cycle                          5,566
    Memory [%]                                                                           %                           8.05
    DRAM Throughput                                                                      %                           8.05
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.78
    L2 Cache Throughput                                                                  %                           7.15
    SM Active Cycles                                                                 cycle                       1,326.66
    Compute (SM) [%]                                                                     %                           2.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.33
    Achieved Active Warps Per SM                                                      warp                           7.84
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:06, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.24
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,427
    Memory [%]                                                                           %                           3.41
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          25.02
    L2 Cache Throughput                                                                  %                           3.41
    SM Active Cycles                                                                 cycle                         583.78
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.29
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:06, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.41
    SM Frequency                                                             cycle/nsecond                           1.46
    Elapsed Cycles                                                                   cycle                          5,571
    Memory [%]                                                                           %                           8.05
    DRAM Throughput                                                                      %                           8.05
    Duration                                                                       usecond                           3.81
    L1/TEX Cache Throughput                                                              %                          10.41
    L2 Cache Throughput                                                                  %                           7.14
    SM Active Cycles                                                                 cycle                       1,374.63
    Compute (SM) [%]                                                                     %                           2.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.82
    Achieved Active Warps Per SM                                                      warp                           7.59
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:07, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.24
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,431
    Memory [%]                                                                           %                           3.41
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.56
    L1/TEX Cache Throughput                                                              %                          24.70
    L2 Cache Throughput                                                                  %                           3.41
    SM Active Cycles                                                                 cycle                         593.01
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:07, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.78
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                          4,948
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.69
    L2 Cache Throughput                                                                  %                           8.11
    SM Active Cycles                                                                 cycle                       1,338.62
    Compute (SM) [%]                                                                     %                           2.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.87
    Achieved Active Warps Per SM                                                      warp                           7.62
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:07, Context 1, Stream 25
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.08
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          7,126
    Memory [%]                                                                           %                           4.93
    DRAM Throughput                                                                      %                           1.57
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          24.01
    L2 Cache Throughput                                                                  %                           4.93
    SM Active Cycles                                                                 cycle                         942.69
    Compute (SM) [%]                                                                     %                           2.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.28
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:08, Context 1, Stream 27
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.21
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          7,226
    Memory [%]                                                                           %                           4.86
    DRAM Throughput                                                                      %                           1.55
    Duration                                                                       usecond                           5.66
    L1/TEX Cache Throughput                                                              %                          23.29
    L2 Cache Throughput                                                                  %                           4.86
    SM Active Cycles                                                                 cycle                         952.53
    Compute (SM) [%]                                                                     %                           2.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:08, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.04
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          8,391
    Memory [%]                                                                           %                           3.43
    DRAM Throughput                                                                      %                           1.91
    Duration                                                                       usecond                           6.69
    L1/TEX Cache Throughput                                                              %                          25.12
    L2 Cache Throughput                                                                  %                           3.43
    SM Active Cycles                                                                 cycle                         583.34
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.31
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:08, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.08
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                          4,980
    Memory [%]                                                                           %                           9.04
    DRAM Throughput                                                                      %                           9.04
    Duration                                                                       usecond                           3.52
    L1/TEX Cache Throughput                                                              %                          10.55
    L2 Cache Throughput                                                                  %                           8.03
    SM Active Cycles                                                                 cycle                       1,355.75
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.15
    Achieved Active Warps Per SM                                                      warp                           7.75
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:09, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.29
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          8,418
    Memory [%]                                                                           %                           3.42
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          25.11
    L2 Cache Throughput                                                                  %                           3.42
    SM Active Cycles                                                                 cycle                         582.22
    Compute (SM) [%]                                                                     %                           1.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:09, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.70
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                          4,937
    Memory [%]                                                                           %                           9.09
    DRAM Throughput                                                                      %                           9.09
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.43
    L2 Cache Throughput                                                                  %                           8.08
    SM Active Cycles                                                                 cycle                       1,371.62
    Compute (SM) [%]                                                                     %                           2.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.16
    Achieved Active Warps Per SM                                                      warp                           7.76
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:09, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.42
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,558
    Memory [%]                                                                           %                           3.36
    DRAM Throughput                                                                      %                           1.87
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          23.55
    L2 Cache Throughput                                                                  %                           3.36
    SM Active Cycles                                                                 cycle                         623.51
    Compute (SM) [%]                                                                     %                           1.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           7.86
    Achieved Active Warps Per SM                                                      warp                           3.77
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:09, Context 1, Stream 26
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.12
    SM Frequency                                                             cycle/nsecond                           1.42
    Elapsed Cycles                                                                   cycle                          5,175
    Memory [%]                                                                           %                           8.67
    DRAM Throughput                                                                      %                           8.67
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.53
    L2 Cache Throughput                                                                  %                           7.67
    SM Active Cycles                                                                 cycle                       1,358.85
    Compute (SM) [%]                                                                     %                           2.84
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.26
    Achieved Active Warps Per SM                                                      warp                           7.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(T1::Params), 2023-Apr-06 16:57:10, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.19
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          8,330
    Memory [%]                                                                           %                           3.46
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.21
    L2 Cache Throughput                                                                  %                           3.46
    SM Active Cycles                                                                 cycle                         580.34
    Compute (SM) [%]                                                                     %                           1.48
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             96
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:10, Context 1, Stream 28
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.03
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                          5,063
    Memory [%]                                                                           %                           8.84
    DRAM Throughput                                                                      %                           8.84
    Duration                                                                       usecond                           3.62
    L1/TEX Cache Throughput                                                              %                          10.29
    L2 Cache Throughput                                                                  %                           7.87
    SM Active Cycles                                                                 cycle                       1,390.15
    Compute (SM) [%]                                                                     %                           2.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.06
    Achieved Active Warps Per SM                                                      warp                           7.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)2, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:10, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.82
    SM Frequency                                                             cycle/nsecond                           1.22
    Elapsed Cycles                                                                   cycle                          5,105
    Memory [%]                                                                           %                           3.10
    DRAM Throughput                                                                      %                           2.51
    Duration                                                                       usecond                           4.19
    L1/TEX Cache Throughput                                                              %                          11.63
    L2 Cache Throughput                                                                  %                           2.63
    SM Active Cycles                                                                 cycle                       1,360.07
    Compute (SM) [%]                                                                     %                          10.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 25.7%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          74.28
    Achieved Active Warps Per SM                                                      warp                          35.65
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (74.3%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:10, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.40
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          5,084
    Memory [%]                                                                           %                           2.40
    DRAM Throughput                                                                      %                           0.01
    Duration                                                                       usecond                           4.42
    L1/TEX Cache Throughput                                                              %                           3.21
    L2 Cache Throughput                                                                  %                           2.40
    SM Active Cycles                                                                 cycle                       1,483.26
    Compute (SM) [%]                                                                     %                           1.22
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.19
    Achieved Active Warps Per SM                                                      warp                           2.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  ampere_sgemm_32x32_sliced1x4_tn, 2023-Apr-06 16:57:11, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.65
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                         13,743
    Memory [%]                                                                           %                           2.55
    DRAM Throughput                                                                      %                           1.86
    Duration                                                                       usecond                          10.18
    L1/TEX Cache Throughput                                                              %                          25.54
    L2 Cache Throughput                                                                  %                           2.47
    SM Active Cycles                                                                 cycle                       1,374.06
    Compute (SM) [%]                                                                     %                           2.25
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             86
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                          32.77
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              3
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             12
    Theoretical Occupancy                                                                %                             25
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (25.0%) is limited by the required amount of shared memory The difference 
          between calculated theoretical (25.0%) and measured achieved occupancy (8.3%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3), 2023-Apr-06 16:57:12, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.53
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          3,622
    Memory [%]                                                                           %                           2.03
    DRAM Throughput                                                                      %                           1.78
    Duration                                                                       usecond                           3.07
    L1/TEX Cache Throughput                                                              %                           2.51
    L2 Cache Throughput                                                                  %                           2.03
    SM Active Cycles                                                                 cycle                         753.94
    Compute (SM) [%]                                                                     %                           0.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             19
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           4.22
    Achieved Active Warps Per SM                                                      warp                           2.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:57:12, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.08
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          5,074
    Memory [%]                                                                           %                           0.77
    DRAM Throughput                                                                      %                           0.04
    Duration                                                                       usecond                           4.03
    L1/TEX Cache Throughput                                                              %                           1.20
    L2 Cache Throughput                                                                  %                           0.77
    SM Active Cycles                                                                 cycle                       1,254.99
    Compute (SM) [%]                                                                     %                           1.36
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.43
    Achieved Active Warps Per SM                                                      warp                           4.05
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:12, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.16
    SM Frequency                                                             cycle/nsecond                           1.42
    Elapsed Cycles                                                                   cycle                          5,325
    Memory [%]                                                                           %                           1.11
    DRAM Throughput                                                                      %                           0.61
    Duration                                                                       usecond                           3.74
    L1/TEX Cache Throughput                                                              %                           2.89
    L2 Cache Throughput                                                                  %                           1.11
    SM Active Cycles                                                                 cycle                         426.87
    Compute (SM) [%]                                                                     %                           1.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.70
    Achieved Active Warps Per SM                                                      warp                           8.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.36
    SM Frequency                                                             cycle/nsecond                           1.12
    Elapsed Cycles                                                                   cycle                          7,269
    Memory [%]                                                                           %                          35.30
    DRAM Throughput                                                                      %                           2.49
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          37.80
    L2 Cache Throughput                                                                  %                          35.30
    SM Active Cycles                                                                 cycle                       4,779.07
    Compute (SM) [%]                                                                     %                          30.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         960
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         61,440
    Waves Per SM                                                                                                     0.88
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          48.11
    Achieved Active Warps Per SM                                                      warp                          23.09
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (48.1%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.56
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                         18,581
    Memory [%]                                                                           %                          35.49
    DRAM Throughput                                                                      %                          35.49
    Duration                                                                       usecond                          14.27
    L1/TEX Cache Throughput                                                              %                          18.01
    L2 Cache Throughput                                                                  %                          25.87
    SM Active Cycles                                                                 cycle                      13,854.76
    Compute (SM) [%]                                                                     %                          42.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             26
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 26.9%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          73.06
    Achieved Active Warps Per SM                                                      warp                          35.07
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (73.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.73
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                         13,977
    Memory [%]                                                                           %                          40.58
    DRAM Throughput                                                                      %                          40.58
    Duration                                                                       usecond                          10.43
    L1/TEX Cache Throughput                                                              %                          20.56
    L2 Cache Throughput                                                                  %                          20.42
    SM Active Cycles                                                                 cycle                      10,105.90
    Compute (SM) [%]                                                                     %                          19.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.35
    Achieved Active Warps Per SM                                                      warp                           4.01
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char *, (int)3>, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:13, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.32
    SM Frequency                                                             cycle/nsecond                           1.13
    Elapsed Cycles                                                                   cycle                          6,479
    Memory [%]                                                                           %                           9.95
    DRAM Throughput                                                                      %                           3.69
    Duration                                                                       usecond                           5.70
    L1/TEX Cache Throughput                                                              %                           9.11
    L2 Cache Throughput                                                                  %                           9.95
    SM Active Cycles                                                                 cycle                       4,206.78
    Compute (SM) [%]                                                                     %                           3.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             22
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           7.31
    Achieved Active Warps Per SM                                                      warp                           3.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (7.3%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3), 2023-Apr-06 16:57:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.55
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          3,884
    Memory [%]                                                                           %                           6.18
    DRAM Throughput                                                                      %                           6.18
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           3.70
    L2 Cache Throughput                                                                  %                           6.05
    SM Active Cycles                                                                 cycle                       1,770.01
    Compute (SM) [%]                                                                     %                           1.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             19
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           6.90
    Achieved Active Warps Per SM                                                      warp                           3.31
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.80
    SM Frequency                                                             cycle/nsecond                           1.06
    Elapsed Cycles                                                                   cycle                          5,637
    Memory [%]                                                                           %                           6.40
    DRAM Throughput                                                                      %                           4.26
    Duration                                                                       usecond                           5.31
    L1/TEX Cache Throughput                                                              %                           7.81
    L2 Cache Throughput                                                                  %                           6.40
    SM Active Cycles                                                                 cycle                       3,252.03
    Compute (SM) [%]                                                                     %                          10.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          32.64
    Achieved Active Warps Per SM                                                      warp                          15.67
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (32.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void <unnamed>::softmax_warp_forward<float, float, float, (int)5, (bool)0>(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:14, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.17
    SM Frequency                                                             cycle/nsecond                           1.27
    Elapsed Cycles                                                                   cycle                          4,182
    Memory [%]                                                                           %                           1.29
    DRAM Throughput                                                                      %                           0.72
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           9.66
    L2 Cache Throughput                                                                  %                           1.29
    SM Active Cycles                                                                 cycle                         506.68
    Compute (SM) [%]                                                                     %                           1.17
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             21
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.37
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void gemv2N_kernel<int, int, float, float, float, float, (int)128, (int)8, (int)4, (int)4, (int)1, (bool)0, cublasGemvParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>, float>>(T13), 2023-Apr-06 16:57:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.88
    SM Frequency                                                             cycle/nsecond                           1.36
    Elapsed Cycles                                                                   cycle                         11,881
    Memory [%]                                                                           %                          32.04
    DRAM Throughput                                                                      %                          32.04
    Duration                                                                       usecond                           8.70
    L1/TEX Cache Throughput                                                              %                          22.51
    L2 Cache Throughput                                                                  %                          14.94
    SM Active Cycles                                                                 cycle                       9,631.32
    Compute (SM) [%]                                                                     %                          25.22
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             45
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           2.56
    Threads                                                                         thread                        131,072
    Waves Per SM                                                                                                     1.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks.  
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 25.8%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             10
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          61.86
    Achieved Active Warps Per SM                                                      warp                          29.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (61.9%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:15, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.61
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          5,189
    Memory [%]                                                                           %                           3.20
    DRAM Throughput                                                                      %                           3.08
    Duration                                                                       usecond                           4.38
    L1/TEX Cache Throughput                                                              %                           9.85
    L2 Cache Throughput                                                                  %                           3.19
    SM Active Cycles                                                                 cycle                       1,681.49
    Compute (SM) [%]                                                                     %                          11.21
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             24
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 38.1%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          61.87
    Achieved Active Warps Per SM                                                      warp                          29.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (61.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:15, Context 1, Stream 29
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.54
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                         11,764
    Memory [%]                                                                           %                           4.49
    DRAM Throughput                                                                      %                           3.39
    Duration                                                                       usecond                           8.83
    L1/TEX Cache Throughput                                                              %                          25.76
    L2 Cache Throughput                                                                  %                           4.49
    SM Active Cycles                                                                 cycle                         890.47
    Compute (SM) [%]                                                                     %                           1.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.24
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:15, Context 1, Stream 30
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.33
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          8,415
    Memory [%]                                                                           %                           3.39
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.46
    L1/TEX Cache Throughput                                                              %                          26.87
    L2 Cache Throughput                                                                  %                           3.39
    SM Active Cycles                                                                 cycle                         588.82
    Compute (SM) [%]                                                                     %                           1.40
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.33
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:16, Context 1, Stream 30
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.43
    SM Frequency                                                             cycle/nsecond                           1.46
    Elapsed Cycles                                                                   cycle                          5,255
    Memory [%]                                                                           %                           8.54
    DRAM Throughput                                                                      %                           8.54
    Duration                                                                       usecond                           3.58
    L1/TEX Cache Throughput                                                              %                          10.85
    L2 Cache Throughput                                                                  %                           7.60
    SM Active Cycles                                                                 cycle                       1,319.07
    Compute (SM) [%]                                                                     %                           2.80
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.94
    Achieved Active Warps Per SM                                                      warp                           7.65
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)2, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.78
    SM Frequency                                                             cycle/nsecond                           1.21
    Elapsed Cycles                                                                   cycle                          5,102
    Memory [%]                                                                           %                           4.80
    DRAM Throughput                                                                      %                           4.37
    Duration                                                                       usecond                           4.22
    L1/TEX Cache Throughput                                                              %                          12.18
    L2 Cache Throughput                                                                  %                           4.39
    SM Active Cycles                                                                 cycle                       2,009.10
    Compute (SM) [%]                                                                     %                          16.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        208,896
    Waves Per SM                                                                                                        2
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          66.65
    Achieved Active Warps Per SM                                                      warp                          31.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (66.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.47
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                         21,453
    Memory [%]                                                                           %                          72.91
    DRAM Throughput                                                                      %                          49.62
    Duration                                                                       usecond                          16.70
    L1/TEX Cache Throughput                                                              %                          74.73
    L2 Cache Throughput                                                                  %                          72.91
    SM Active Cycles                                                                 cycle                      18,766.71
    Compute (SM) [%]                                                                     %                          45.24
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 
          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
          whether there are values you can (re)compute.                                                                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       5,419
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        346,816
    Waves Per SM                                                                                                     4.98
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          54.75
    Achieved Active Warps Per SM                                                      warp                          26.28
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (54.8%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_32x3_tn_align1>(T1::Params), 2023-Apr-06 16:57:16, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.92
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                         96,602
    Memory [%]                                                                           %                          37.21
    DRAM Throughput                                                                      %                          37.21
    Duration                                                                       usecond                          70.18
    L1/TEX Cache Throughput                                                              %                          29.19
    L2 Cache Throughput                                                                  %                          26.57
    SM Active Cycles                                                                 cycle                      61,162.43
    Compute (SM) [%]                                                                     %                          18.59
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          88
    Registers Per Thread                                                   register/thread                            250
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          98.30
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         11,264
    Waves Per SM                                                                                                     1.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                              1
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              4
    Theoretical Occupancy                                                                %                           8.33
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA    
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.86
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                         20,023
    Memory [%]                                                                           %                          53.27
    DRAM Throughput                                                                      %                          53.27
    Duration                                                                       usecond                          16.58
    L1/TEX Cache Throughput                                                              %                          15.11
    L2 Cache Throughput                                                                  %                          24.38
    SM Active Cycles                                                                 cycle                      16,883.25
    Compute (SM) [%]                                                                     %                          36.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         128
    Registers Per Thread                                                   register/thread                             40
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                           8.19
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         65,536
    Waves Per SM                                                                                                     0.63
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              3
    Block Limit Shared Mem                                                           block                              3
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          60.89
    Achieved Active Warps Per SM                                                      warp                          29.23
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (60.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:57:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.52
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                          4,759
    Memory [%]                                                                           %                           0.93
    DRAM Throughput                                                                      %                           0.15
    Duration                                                                       usecond                           4.03
    L1/TEX Cache Throughput                                                              %                           1.17
    L2 Cache Throughput                                                                  %                           0.93
    SM Active Cycles                                                                 cycle                       1,285.32
    Compute (SM) [%]                                                                     %                           1.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.18
    Achieved Active Warps Per SM                                                      warp                           3.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:17, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.03
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                          5,299
    Memory [%]                                                                           %                           1.11
    DRAM Throughput                                                                      %                           0.61
    Duration                                                                       usecond                           3.78
    L1/TEX Cache Throughput                                                              %                           3.00
    L2 Cache Throughput                                                                  %                           1.11
    SM Active Cycles                                                                 cycle                         411.87
    Compute (SM) [%]                                                                     %                           1.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.59
    Achieved Active Warps Per SM                                                      warp                           7.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.37
    SM Frequency                                                             cycle/nsecond                           1.12
    Elapsed Cycles                                                                   cycle                          7,242
    Memory [%]                                                                           %                          35.45
    DRAM Throughput                                                                      %                           2.24
    Duration                                                                       usecond                           6.46
    L1/TEX Cache Throughput                                                              %                          37.96
    L2 Cache Throughput                                                                  %                          35.45
    SM Active Cycles                                                                 cycle                       4,840.43
    Compute (SM) [%]                                                                     %                          30.56
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         960
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         61,440
    Waves Per SM                                                                                                     0.88
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          48.14
    Achieved Active Warps Per SM                                                      warp                          23.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (48.1%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.51
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                         18,660
    Memory [%]                                                                           %                          35.38
    DRAM Throughput                                                                      %                          35.38
    Duration                                                                       usecond                          14.43
    L1/TEX Cache Throughput                                                              %                          17.91
    L2 Cache Throughput                                                                  %                          25.73
    SM Active Cycles                                                                 cycle                      13,814.72
    Compute (SM) [%]                                                                     %                          42.49
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             26
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 26.6%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          73.45
    Achieved Active Warps Per SM                                                      warp                          35.25
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (73.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.65
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                         14,013
    Memory [%]                                                                           %                          40.58
    DRAM Throughput                                                                      %                          40.58
    Duration                                                                       usecond                          10.53
    L1/TEX Cache Throughput                                                              %                          20.21
    L2 Cache Throughput                                                                  %                          20.43
    SM Active Cycles                                                                 cycle                      10,277.59
    Compute (SM) [%]                                                                     %                          19.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.27
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char *, (int)3>, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:18, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.33
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                          6,413
    Memory [%]                                                                           %                          10.05
    DRAM Throughput                                                                      %                           3.72
    Duration                                                                       usecond                           5.63
    L1/TEX Cache Throughput                                                              %                           9.20
    L2 Cache Throughput                                                                  %                          10.05
    SM Active Cycles                                                                 cycle                       4,185.32
    Compute (SM) [%]                                                                     %                           3.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             22
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           7.24
    Achieved Active Warps Per SM                                                      warp                           3.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3), 2023-Apr-06 16:57:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.64
    SM Frequency                                                             cycle/nsecond                           1.19
    Elapsed Cycles                                                                   cycle                          3,918
    Memory [%]                                                                           %                           6.10
    DRAM Throughput                                                                      %                           6.10
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           3.61
    L2 Cache Throughput                                                                  %                           5.99
    SM Active Cycles                                                                 cycle                       1,816.96
    Compute (SM) [%]                                                                     %                           1.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             19
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           6.68
    Achieved Active Warps Per SM                                                      warp                           3.20
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (6.7%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.74
    SM Frequency                                                             cycle/nsecond                           1.05
    Elapsed Cycles                                                                   cycle                          5,603
    Memory [%]                                                                           %                           6.43
    DRAM Throughput                                                                      %                           4.27
    Duration                                                                       usecond                           5.34
    L1/TEX Cache Throughput                                                              %                           8.07
    L2 Cache Throughput                                                                  %                           6.43
    SM Active Cycles                                                                 cycle                       3,150.84
    Compute (SM) [%]                                                                     %                          10.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          33.02
    Achieved Active Warps Per SM                                                      warp                          15.85
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (33.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void <unnamed>::softmax_warp_forward<float, float, float, (int)5, (bool)0>(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:19, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.98
    SM Frequency                                                             cycle/nsecond                           1.24
    Elapsed Cycles                                                                   cycle                          4,092
    Memory [%]                                                                           %                           1.32
    DRAM Throughput                                                                      %                           0.74
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           9.77
    L2 Cache Throughput                                                                  %                           1.32
    SM Active Cycles                                                                 cycle                         501.09
    Compute (SM) [%]                                                                     %                           1.20
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             21
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.38
    Achieved Active Warps Per SM                                                      warp                           4.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void gemv2N_kernel<int, int, float, float, float, float, (int)128, (int)8, (int)4, (int)4, (int)1, (bool)0, cublasGemvParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>, float>>(T13), 2023-Apr-06 16:57:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.22
    SM Frequency                                                             cycle/nsecond                           1.41
    Elapsed Cycles                                                                   cycle                         11,975
    Memory [%]                                                                           %                          31.79
    DRAM Throughput                                                                      %                          31.79
    Duration                                                                       usecond                           8.45
    L1/TEX Cache Throughput                                                              %                          22.70
    L2 Cache Throughput                                                                  %                          14.82
    SM Active Cycles                                                                 cycle                       9,554.74
    Compute (SM) [%]                                                                     %                          25.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             45
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           2.56
    Threads                                                                         thread                        131,072
    Waves Per SM                                                                                                     1.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks.  
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 25.0%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             10
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          62.46
    Achieved Active Warps Per SM                                                      warp                          29.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (62.5%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:20, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.80
    SM Frequency                                                             cycle/nsecond                           1.21
    Elapsed Cycles                                                                   cycle                          5,247
    Memory [%]                                                                           %                           3.16
    DRAM Throughput                                                                      %                           3.05
    Duration                                                                       usecond                           4.32
    L1/TEX Cache Throughput                                                              %                           9.67
    L2 Cache Throughput                                                                  %                           3.14
    SM Active Cycles                                                                 cycle                       1,713.79
    Compute (SM) [%]                                                                     %                          11.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             24
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 37.1%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          62.92
    Achieved Active Warps Per SM                                                      warp                          30.20
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (62.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:20, Context 1, Stream 31
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.63
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                         11,851
    Memory [%]                                                                           %                           4.46
    DRAM Throughput                                                                      %                           3.36
    Duration                                                                       usecond                           8.83
    L1/TEX Cache Throughput                                                              %                          26.00
    L2 Cache Throughput                                                                  %                           4.46
    SM Active Cycles                                                                 cycle                         882.29
    Compute (SM) [%]                                                                     %                           1.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:21, Context 1, Stream 32
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.25
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,382
    Memory [%]                                                                           %                           3.40
    DRAM Throughput                                                                      %                           1.90
    Duration                                                                       usecond                           6.53
    L1/TEX Cache Throughput                                                              %                          25.84
    L2 Cache Throughput                                                                  %                           3.40
    SM Active Cycles                                                                 cycle                         611.31
    Compute (SM) [%]                                                                     %                           1.40
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           7.92
    Achieved Active Warps Per SM                                                      warp                           3.80
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:21, Context 1, Stream 32
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.62
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                          4,893
    Memory [%]                                                                           %                           9.18
    DRAM Throughput                                                                      %                           9.18
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          10.64
    L2 Cache Throughput                                                                  %                           8.16
    SM Active Cycles                                                                 cycle                       1,344.53
    Compute (SM) [%]                                                                     %                           3.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          15.54
    Achieved Active Warps Per SM                                                      warp                           7.46
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (15.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)2, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.43
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          4,885
    Memory [%]                                                                           %                           5.01
    DRAM Throughput                                                                      %                           4.54
    Duration                                                                       usecond                           4.26
    L1/TEX Cache Throughput                                                              %                          12.67
    L2 Cache Throughput                                                                  %                           4.59
    SM Active Cycles                                                                 cycle                       1,931.32
    Compute (SM) [%]                                                                     %                          16.80
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        208,896
    Waves Per SM                                                                                                        2
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          67.24
    Achieved Active Warps Per SM                                                      warp                          32.28
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (67.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.53
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                         21,655
    Memory [%]                                                                           %                          72.27
    DRAM Throughput                                                                      %                          48.57
    Duration                                                                       usecond                          16.74
    L1/TEX Cache Throughput                                                              %                          74.03
    L2 Cache Throughput                                                                  %                          72.27
    SM Active Cycles                                                                 cycle                      18,808.60
    Compute (SM) [%]                                                                     %                          44.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 
          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
          whether there are values you can (re)compute.                                                                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       5,419
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        346,816
    Waves Per SM                                                                                                     4.98
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          54.75
    Achieved Active Warps Per SM                                                      warp                          26.28
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (54.8%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_32x3_tn_align1>(T1::Params), 2023-Apr-06 16:57:21, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.07
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                         96,474
    Memory [%]                                                                           %                          37.23
    DRAM Throughput                                                                      %                          37.23
    Duration                                                                       usecond                          68.99
    L1/TEX Cache Throughput                                                              %                          29.07
    L2 Cache Throughput                                                                  %                          26.60
    SM Active Cycles                                                                 cycle                      61,365.76
    Compute (SM) [%]                                                                     %                          18.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          88
    Registers Per Thread                                                   register/thread                            250
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          98.30
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         11,264
    Waves Per SM                                                                                                     1.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                              1
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              4
    Theoretical Occupancy                                                                %                           8.33
    Achieved Occupancy                                                                   %                           8.32
    Achieved Active Warps Per SM                                                      warp                           3.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA    
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.79
    SM Frequency                                                             cycle/nsecond                           1.19
    Elapsed Cycles                                                                   cycle                         19,703
    Memory [%]                                                                           %                          54.12
    DRAM Throughput                                                                      %                          54.12
    Duration                                                                       usecond                          16.45
    L1/TEX Cache Throughput                                                              %                          15.03
    L2 Cache Throughput                                                                  %                          24.81
    SM Active Cycles                                                                 cycle                      16,976.29
    Compute (SM) [%]                                                                     %                          37.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         128
    Registers Per Thread                                                   register/thread                             40
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                           8.19
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         65,536
    Waves Per SM                                                                                                     0.63
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              3
    Block Limit Shared Mem                                                           block                              3
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          60.75
    Achieved Active Warps Per SM                                                      warp                          29.16
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (60.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:57:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.04
    SM Frequency                                                             cycle/nsecond                           1.25
    Elapsed Cycles                                                                   cycle                          5,154
    Memory [%]                                                                           %                           0.96
    DRAM Throughput                                                                      %                           0.31
    Duration                                                                       usecond                           4.13
    L1/TEX Cache Throughput                                                              %                           1.18
    L2 Cache Throughput                                                                  %                           0.96
    SM Active Cycles                                                                 cycle                       1,273.43
    Compute (SM) [%]                                                                     %                           1.34
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.10
    Achieved Active Warps Per SM                                                      warp                           3.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.1%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:22, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.26
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          4,812
    Memory [%]                                                                           %                           1.22
    DRAM Throughput                                                                      %                           0.67
    Duration                                                                       usecond                           3.74
    L1/TEX Cache Throughput                                                              %                           2.86
    L2 Cache Throughput                                                                  %                           1.22
    SM Active Cycles                                                                 cycle                         431.44
    Compute (SM) [%]                                                                     %                           1.17
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.46
    Achieved Active Warps Per SM                                                      warp                           7.90
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.19
    SM Frequency                                                             cycle/nsecond                           1.09
    Elapsed Cycles                                                                   cycle                          7,224
    Memory [%]                                                                           %                          35.55
    DRAM Throughput                                                                      %                           2.37
    Duration                                                                       usecond                           6.59
    L1/TEX Cache Throughput                                                              %                          38.06
    L2 Cache Throughput                                                                  %                          35.55
    SM Active Cycles                                                                 cycle                       4,768.66
    Compute (SM) [%]                                                                     %                          30.74
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         960
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         61,440
    Waves Per SM                                                                                                     0.88
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          48.86
    Achieved Active Warps Per SM                                                      warp                          23.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (48.9%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.59
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                         18,607
    Memory [%]                                                                           %                          34.82
    DRAM Throughput                                                                      %                          34.82
    Duration                                                                       usecond                          14.27
    L1/TEX Cache Throughput                                                              %                          17.97
    L2 Cache Throughput                                                                  %                          25.77
    SM Active Cycles                                                                 cycle                      13,854.60
    Compute (SM) [%]                                                                     %                          42.62
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             26
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 26.6%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          73.37
    Achieved Active Warps Per SM                                                      warp                          35.22
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (73.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.58
    SM Frequency                                                             cycle/nsecond                           1.32
    Elapsed Cycles                                                                   cycle                         14,003
    Memory [%]                                                                           %                          40.63
    DRAM Throughput                                                                      %                          40.63
    Duration                                                                       usecond                          10.59
    L1/TEX Cache Throughput                                                              %                          20.20
    L2 Cache Throughput                                                                  %                          20.42
    SM Active Cycles                                                                 cycle                      10,282.13
    Compute (SM) [%]                                                                     %                          19.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.22
    Achieved Active Warps Per SM                                                      warp                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char *, (int)3>, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:23, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.13
    SM Frequency                                                             cycle/nsecond                           1.11
    Elapsed Cycles                                                                   cycle                          6,401
    Memory [%]                                                                           %                          10.01
    DRAM Throughput                                                                      %                           3.74
    Duration                                                                       usecond                           5.76
    L1/TEX Cache Throughput                                                              %                           9.21
    L2 Cache Throughput                                                                  %                          10.01
    SM Active Cycles                                                                 cycle                       4,229.12
    Compute (SM) [%]                                                                     %                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             22
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           7.18
    Achieved Active Warps Per SM                                                      warp                           3.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3), 2023-Apr-06 16:57:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.37
    SM Frequency                                                             cycle/nsecond                           1.15
    Elapsed Cycles                                                                   cycle                          3,896
    Memory [%]                                                                           %                           6.16
    DRAM Throughput                                                                      %                           6.16
    Duration                                                                       usecond                           3.39
    L1/TEX Cache Throughput                                                              %                           3.71
    L2 Cache Throughput                                                                  %                           5.80
    SM Active Cycles                                                                 cycle                       1,765.78
    Compute (SM) [%]                                                                     %                           1.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             19
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           6.93
    Achieved Active Warps Per SM                                                      warp                           3.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.97
    SM Frequency                                                             cycle/nsecond                           1.08
    Elapsed Cycles                                                                   cycle                          5,815
    Memory [%]                                                                           %                           6.16
    DRAM Throughput                                                                      %                           4.11
    Duration                                                                       usecond                           5.38
    L1/TEX Cache Throughput                                                              %                           7.85
    L2 Cache Throughput                                                                  %                           6.16
    SM Active Cycles                                                                 cycle                       3,235.85
    Compute (SM) [%]                                                                     %                          10.12
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          33.16
    Achieved Active Warps Per SM                                                      warp                          15.92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (33.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void <unnamed>::softmax_warp_forward<float, float, float, (int)5, (bool)0>(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:24, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.08
    SM Frequency                                                             cycle/nsecond                           1.26
    Elapsed Cycles                                                                   cycle                          4,150
    Memory [%]                                                                           %                           1.30
    DRAM Throughput                                                                      %                           0.73
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           9.49
    L2 Cache Throughput                                                                  %                           1.30
    SM Active Cycles                                                                 cycle                         515.75
    Compute (SM) [%]                                                                     %                           1.18
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             21
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.18
    Achieved Active Warps Per SM                                                      warp                           3.93
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void gemv2N_kernel<int, int, float, float, float, float, (int)128, (int)8, (int)4, (int)4, (int)1, (bool)0, cublasGemvParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>, float>>(T13), 2023-Apr-06 16:57:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.93
    SM Frequency                                                             cycle/nsecond                           1.37
    Elapsed Cycles                                                                   cycle                         11,735
    Memory [%]                                                                           %                          32.47
    DRAM Throughput                                                                      %                          32.47
    Duration                                                                       usecond                           8.54
    L1/TEX Cache Throughput                                                              %                          22.77
    L2 Cache Throughput                                                                  %                          15.16
    SM Active Cycles                                                                 cycle                       9,522.75
    Compute (SM) [%]                                                                     %                          25.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             45
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           2.56
    Threads                                                                         thread                        131,072
    Waves Per SM                                                                                                     1.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks.  
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 24.5%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             10
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          62.94
    Achieved Active Warps Per SM                                                      warp                          30.21
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (62.9%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:25, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.04
    SM Frequency                                                             cycle/nsecond                           1.24
    Elapsed Cycles                                                                   cycle                          5,377
    Memory [%]                                                                           %                           3.09
    DRAM Throughput                                                                      %                           2.96
    Duration                                                                       usecond                           4.32
    L1/TEX Cache Throughput                                                              %                           9.85
    L2 Cache Throughput                                                                  %                           3.07
    SM Active Cycles                                                                 cycle                       1,681.04
    Compute (SM) [%]                                                                     %                          10.81
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             24
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 38.3%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          61.66
    Achieved Active Warps Per SM                                                      warp                          29.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (61.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:25, Context 1, Stream 33
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.89
    SM Frequency                                                             cycle/nsecond                           1.39
    Elapsed Cycles                                                                   cycle                         12,251
    Memory [%]                                                                           %                           4.31
    DRAM Throughput                                                                      %                           3.26
    Duration                                                                       usecond                           8.83
    L1/TEX Cache Throughput                                                              %                          25.72
    L2 Cache Throughput                                                                  %                           4.31
    SM Active Cycles                                                                 cycle                         892.06
    Compute (SM) [%]                                                                     %                           1.84
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.34
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:26, Context 1, Stream 34
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.40
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                          8,448
    Memory [%]                                                                           %                           3.38
    DRAM Throughput                                                                      %                           1.89
    Duration                                                                       usecond                           6.46
    L1/TEX Cache Throughput                                                              %                          26.81
    L2 Cache Throughput                                                                  %                           3.38
    SM Active Cycles                                                                 cycle                         588.91
    Compute (SM) [%]                                                                     %                           1.39
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.24
    Achieved Active Warps Per SM                                                      warp                           3.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:26, Context 1, Stream 34
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.45
    SM Frequency                                                             cycle/nsecond                           1.32
    Elapsed Cycles                                                                   cycle                          4,819
    Memory [%]                                                                           %                           9.37
    DRAM Throughput                                                                      %                           9.37
    Duration                                                                       usecond                           3.65
    L1/TEX Cache Throughput                                                              %                          11.16
    L2 Cache Throughput                                                                  %                           8.29
    SM Active Cycles                                                                 cycle                       1,282.44
    Compute (SM) [%]                                                                     %                           3.05
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.05
    Achieved Active Warps Per SM                                                      warp                           7.70
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)2, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.81
    SM Frequency                                                             cycle/nsecond                           1.21
    Elapsed Cycles                                                                   cycle                          5,014
    Memory [%]                                                                           %                           4.88
    DRAM Throughput                                                                      %                           4.45
    Duration                                                                       usecond                           4.13
    L1/TEX Cache Throughput                                                              %                          12.32
    L2 Cache Throughput                                                                  %                           4.50
    SM Active Cycles                                                                 cycle                       1,985.81
    Compute (SM) [%]                                                                     %                          16.32
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        208,896
    Waves Per SM                                                                                                        2
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          68.05
    Achieved Active Warps Per SM                                                      warp                          32.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (68.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.62
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                         21,565
    Memory [%]                                                                           %                          72.38
    DRAM Throughput                                                                      %                          48.70
    Duration                                                                       usecond                          16.51
    L1/TEX Cache Throughput                                                              %                          74.29
    L2 Cache Throughput                                                                  %                          72.38
    SM Active Cycles                                                                 cycle                      18,796.43
    Compute (SM) [%]                                                                     %                          44.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 
          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
          whether there are values you can (re)compute.                                                                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       5,419
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        346,816
    Waves Per SM                                                                                                     4.98
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          54.69
    Achieved Active Warps Per SM                                                      warp                          26.25
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (54.7%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_32x3_tn_align1>(T1::Params), 2023-Apr-06 16:57:26, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.98
    SM Frequency                                                             cycle/nsecond                           1.38
    Elapsed Cycles                                                                   cycle                         96,009
    Memory [%]                                                                           %                          37.59
    DRAM Throughput                                                                      %                          37.59
    Duration                                                                       usecond                          69.34
    L1/TEX Cache Throughput                                                              %                          29.21
    L2 Cache Throughput                                                                  %                          26.75
    SM Active Cycles                                                                 cycle                      61,011.24
    Compute (SM) [%]                                                                     %                          18.71
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          88
    Registers Per Thread                                                   register/thread                            250
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          98.30
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         11,264
    Waves Per SM                                                                                                     1.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                              1
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              4
    Theoretical Occupancy                                                                %                           8.33
    Achieved Occupancy                                                                   %                           8.34
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA    
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:27, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.72
    SM Frequency                                                             cycle/nsecond                           1.18
    Elapsed Cycles                                                                   cycle                         19,602
    Memory [%]                                                                           %                          54.45
    DRAM Throughput                                                                      %                          54.45
    Duration                                                                       usecond                          16.51
    L1/TEX Cache Throughput                                                              %                          15.07
    L2 Cache Throughput                                                                  %                          24.94
    SM Active Cycles                                                                 cycle                      16,925.59
    Compute (SM) [%]                                                                     %                          37.24
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         128
    Registers Per Thread                                                   register/thread                             40
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                           8.19
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         65,536
    Waves Per SM                                                                                                     0.63
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              3
    Block Limit Shared Mem                                                           block                              3
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          60.95
    Achieved Active Warps Per SM                                                      warp                          29.25
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (60.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:57:27, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.69
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                          4,780
    Memory [%]                                                                           %                           1.05
    DRAM Throughput                                                                      %                           0.40
    Duration                                                                       usecond                           3.97
    L1/TEX Cache Throughput                                                              %                           1.16
    L2 Cache Throughput                                                                  %                           1.05
    SM Active Cycles                                                                 cycle                       1,293.65
    Compute (SM) [%]                                                                     %                           1.45
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.14
    Achieved Active Warps Per SM                                                      warp                           3.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.1%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:27, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.33
    SM Frequency                                                             cycle/nsecond                           1.45
    Elapsed Cycles                                                                   cycle                          5,480
    Memory [%]                                                                           %                           1.08
    DRAM Throughput                                                                      %                           0.59
    Duration                                                                       usecond                           3.78
    L1/TEX Cache Throughput                                                              %                           3.00
    L2 Cache Throughput                                                                  %                           1.08
    SM Active Cycles                                                                 cycle                         412.06
    Compute (SM) [%]                                                                     %                           1.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.65
    Achieved Active Warps Per SM                                                      warp                           7.99
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:28, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.50
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                          7,346
    Memory [%]                                                                           %                          34.80
    DRAM Throughput                                                                      %                           2.29
    Duration                                                                       usecond                           6.43
    L1/TEX Cache Throughput                                                              %                          37.36
    L2 Cache Throughput                                                                  %                          34.80
    SM Active Cycles                                                                 cycle                       4,835.15
    Compute (SM) [%]                                                                     %                          30.09
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         960
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         61,440
    Waves Per SM                                                                                                     0.88
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          47.86
    Achieved Active Warps Per SM                                                      warp                          22.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (47.9%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:28, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.68
    SM Frequency                                                             cycle/nsecond                           1.31
    Elapsed Cycles                                                                   cycle                         18,747
    Memory [%]                                                                           %                          34.72
    DRAM Throughput                                                                      %                          34.72
    Duration                                                                       usecond                          14.21
    L1/TEX Cache Throughput                                                              %                          17.82
    L2 Cache Throughput                                                                  %                          25.63
    SM Active Cycles                                                                 cycle                      13,817.29
    Compute (SM) [%]                                                                     %                          42.30
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             26
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 27.1%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          72.91
    Achieved Active Warps Per SM                                                      warp                          35.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (72.9%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:28, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.71
    SM Frequency                                                             cycle/nsecond                           1.33
    Elapsed Cycles                                                                   cycle                         13,942
    Memory [%]                                                                           %                          40.67
    DRAM Throughput                                                                      %                          40.67
    Duration                                                                       usecond                          10.43
    L1/TEX Cache Throughput                                                              %                          20.53
    L2 Cache Throughput                                                                  %                          20.47
    SM Active Cycles                                                                 cycle                      10,126.29
    Compute (SM) [%]                                                                     %                          19.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.44
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.97
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char *, (int)3>, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:28, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.19
    SM Frequency                                                             cycle/nsecond                           1.12
    Elapsed Cycles                                                                   cycle                          6,405
    Memory [%]                                                                           %                          10.07
    DRAM Throughput                                                                      %                           3.73
    Duration                                                                       usecond                           5.73
    L1/TEX Cache Throughput                                                              %                           9.20
    L2 Cache Throughput                                                                  %                          10.07
    SM Active Cycles                                                                 cycle                       4,169.43
    Compute (SM) [%]                                                                     %                           3.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             22
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           7.23
    Achieved Active Warps Per SM                                                      warp                           3.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3), 2023-Apr-06 16:57:29, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.90
    SM Frequency                                                             cycle/nsecond                           1.23
    Elapsed Cycles                                                                   cycle                          3,934
    Memory [%]                                                                           %                           6.09
    DRAM Throughput                                                                      %                           6.09
    Duration                                                                       usecond                           3.20
    L1/TEX Cache Throughput                                                              %                           3.74
    L2 Cache Throughput                                                                  %                           5.77
    SM Active Cycles                                                                 cycle                       1,751.49
    Compute (SM) [%]                                                                     %                           1.60
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         120
    Registers Per Thread                                                   register/thread                             19
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          7,680
    Waves Per SM                                                                                                     0.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                           6.87
    Achieved Active Warps Per SM                                                      warp                           3.30
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:29, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.80
    SM Frequency                                                             cycle/nsecond                           1.05
    Elapsed Cycles                                                                   cycle                          5,627
    Memory [%]                                                                           %                           6.37
    DRAM Throughput                                                                      %                           4.23
    Duration                                                                       usecond                           5.34
    L1/TEX Cache Throughput                                                              %                           8.10
    L2 Cache Throughput                                                                  %                           6.37
    SM Active Cycles                                                                 cycle                       3,136.24
    Compute (SM) [%]                                                                     %                          10.47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          60
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         30,720
    Waves Per SM                                                                                                     0.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              7
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          33.06
    Achieved Active Warps Per SM                                                      warp                          15.87
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (33.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void <unnamed>::softmax_warp_forward<float, float, float, (int)5, (bool)0>(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:29, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.30
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          4,266
    Memory [%]                                                                           %                           1.27
    DRAM Throughput                                                                      %                           0.71
    Duration                                                                       usecond                           3.30
    L1/TEX Cache Throughput                                                              %                           9.67
    L2 Cache Throughput                                                                  %                           1.27
    SM Active Cycles                                                                 cycle                         506.19
    Compute (SM) [%]                                                                     %                           1.15
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             21
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          2,048
    Waves Per SM                                                                                                     0.02
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.22
    Achieved Active Warps Per SM                                                      warp                           3.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void gemv2N_kernel<int, int, float, float, float, float, (int)128, (int)8, (int)4, (int)4, (int)1, (bool)0, cublasGemvParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>, float>>(T13), 2023-Apr-06 16:57:30, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.14
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                         11,988
    Memory [%]                                                                           %                          31.83
    DRAM Throughput                                                                      %                          31.83
    Duration                                                                       usecond                           8.51
    L1/TEX Cache Throughput                                                              %                          22.47
    L2 Cache Throughput                                                                  %                          14.81
    SM Active Cycles                                                                 cycle                          9,649
    Compute (SM) [%]                                                                     %                          24.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             45
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           2.56
    Threads                                                                         thread                        131,072
    Waves Per SM                                                                                                     1.51
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks.  
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 25.5%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             10
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83.33
    Achieved Occupancy                                                                   %                          62.06
    Achieved Active Warps Per SM                                                      warp                          29.79
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference     
          between calculated theoretical (83.3%) and measured achieved occupancy (62.1%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)3, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:30, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.76
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                          5,227
    Memory [%]                                                                           %                           3.17
    DRAM Throughput                                                                      %                           3.04
    Duration                                                                       usecond                           4.35
    L1/TEX Cache Throughput                                                              %                           9.62
    L2 Cache Throughput                                                                  %                           3.13
    SM Active Cycles                                                                 cycle                       1,722.03
    Compute (SM) [%]                                                                     %                          11.15
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         272
    Registers Per Thread                                                   register/thread                             24
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        139,264
    Waves Per SM                                                                                                     1.33
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks.   
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 37.9%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          62.14
    Achieved Active Warps Per SM                                                      warp                          29.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (62.1%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:30, Context 1, Stream 35
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.66
    SM Frequency                                                             cycle/nsecond                           1.35
    Elapsed Cycles                                                                   cycle                         11,965
    Memory [%]                                                                           %                           4.42
    DRAM Throughput                                                                      %                           3.34
    Duration                                                                       usecond                           8.86
    L1/TEX Cache Throughput                                                              %                          25.55
    L2 Cache Throughput                                                                  %                           4.42
    SM Active Cycles                                                                 cycle                         897.94
    Compute (SM) [%]                                                                     %                           1.89
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.30
    Achieved Active Warps Per SM                                                      warp                           3.98
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align4>(T1::Params), 2023-Apr-06 16:57:30, Context 1, Stream 36
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.27
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                          8,307
    Memory [%]                                                                           %                           3.43
    DRAM Throughput                                                                      %                           1.92
    Duration                                                                       usecond                           6.46
    L1/TEX Cache Throughput                                                              %                          27.38
    L2 Cache Throughput                                                                  %                           3.43
    SM Active Cycles                                                                 cycle                         577.85
    Compute (SM) [%]                                                                     %                           1.42
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           8
    Registers Per Thread                                                   register/thread                             90
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          1,024
    Waves Per SM                                                                                                     0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              2
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              8
    Theoretical Occupancy                                                                %                          16.67
    Achieved Occupancy                                                                   %                           8.34
    Achieved Active Warps Per SM                                                      warp                           4.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA   
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void GRU_elementWise_fp<float, float, float, (cudnnRNNBiasMode_t)2>(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:31, Context 1, Stream 36
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.19
    SM Frequency                                                             cycle/nsecond                           1.43
    Elapsed Cycles                                                                   cycle                          4,979
    Memory [%]                                                                           %                           9.01
    DRAM Throughput                                                                      %                           9.01
    Duration                                                                       usecond                           3.49
    L1/TEX Cache Throughput                                                              %                          10.79
    L2 Cache Throughput                                                                  %                           8.04
    SM Active Cycles                                                                 cycle                       1,325.50
    Compute (SM) [%]                                                                     %                           2.95
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             30
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          8,192
    Waves Per SM                                                                                                     0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.21
    Achieved Active Warps Per SM                                                      warp                           7.78
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)2, (int)128, (int)1>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2), 2023-Apr-06 16:57:31, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.42
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          5,537
    Memory [%]                                                                           %                           4.43
    DRAM Throughput                                                                      %                           4.01
    Duration                                                                       usecond                           4.26
    L1/TEX Cache Throughput                                                              %                          12.45
    L2 Cache Throughput                                                                  %                           4.04
    SM Active Cycles                                                                 cycle                       1,965.16
    Compute (SM) [%]                                                                     %                          14.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         408
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        208,896
    Waves Per SM                                                                                                        2
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          66.38
    Achieved Active Warps Per SM                                                      warp                          31.86
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (66.4%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:31, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.48
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                         21,556
    Memory [%]                                                                           %                          73.22
    DRAM Throughput                                                                      %                          49.62
    Duration                                                                       usecond                          16.61
    L1/TEX Cache Throughput                                                              %                          74.44
    L2 Cache Throughput                                                                  %                          73.22
    SM Active Cycles                                                                 cycle                      18,925.29
    Compute (SM) [%]                                                                     %                          45.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 
          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
          whether there are values you can (re)compute.                                                                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       5,419
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        346,816
    Waves Per SM                                                                                                     4.98
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          54.33
    Achieved Active Warps Per SM                                                      warp                          26.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (54.3%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_32x3_tn_align1>(T1::Params), 2023-Apr-06 16:57:31, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9.10
    SM Frequency                                                             cycle/nsecond                           1.40
    Elapsed Cycles                                                                   cycle                         96,961
    Memory [%]                                                                           %                          37.15
    DRAM Throughput                                                                      %                          37.15
    Duration                                                                       usecond                          69.12
    L1/TEX Cache Throughput                                                              %                          28.94
    L2 Cache Throughput                                                                  %                          26.52
    SM Active Cycles                                                                 cycle                      61,675.54
    Compute (SM) [%]                                                                     %                          18.52
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          88
    Registers Per Thread                                                   register/thread                            250
    Shared Memory Configuration Size                                                 Kbyte                         102.40
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                          98.30
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         11,264
    Waves Per SM                                                                                                     1.29
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                              1
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                              4
    Theoretical Occupancy                                                                %                           8.33
    Achieved Occupancy                                                                   %                           8.24
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA    
          Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for      
          more details on optimizing occupancy.                                                                         

  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:32, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.88
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                         19,839
    Memory [%]                                                                           %                          53.74
    DRAM Throughput                                                                      %                          53.74
    Duration                                                                       usecond                          16.38
    L1/TEX Cache Throughput                                                              %                          15.01
    L2 Cache Throughput                                                                  %                          24.61
    SM Active Cycles                                                                 cycle                      17,002.96
    Compute (SM) [%]                                                                     %                          36.82
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         128
    Registers Per Thread                                                   register/thread                             40
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                            Kbyte/block                           8.19
    Static Shared Memory Per Block                                              byte/block                             16
    Threads                                                                         thread                         65,536
    Waves Per SM                                                                                                     0.63
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
          hardware busy.                                                                                                

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              3
    Block Limit Shared Mem                                                           block                              3
    Block Limit Warps                                                                block                              3
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          60.83
    Achieved Active Warps Per SM                                                      warp                          29.20
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (60.8%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::indexSelectLargeIndex<float, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T2, T3>, int, int, T3, T3, long), 2023-Apr-06 16:57:32, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.29
    SM Frequency                                                             cycle/nsecond                           1.29
    Elapsed Cycles                                                                   cycle                          5,240
    Memory [%]                                                                           %                           0.98
    DRAM Throughput                                                                      %                           0.62
    Duration                                                                       usecond                           4.06
    L1/TEX Cache Throughput                                                              %                           1.19
    L2 Cache Throughput                                                                  %                           0.98
    SM Active Cycles                                                                 cycle                       1,268.40
    Compute (SM) [%]                                                                     %                           1.32
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          32
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             16
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.26
    Achieved Active Warps Per SM                                                      warp                           3.96
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (8.3%) can be the result of warp scheduling overheads    
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::<unnamed>::fused_dropout_kernel_vec<float, float, unsigned int, (int)1, (int)4>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<unsigned char, T3>, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:32, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           8.38
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                          4,791
    Memory [%]                                                                           %                           1.23
    DRAM Throughput                                                                      %                           0.84
    Duration                                                                       usecond                           3.68
    L1/TEX Cache Throughput                                                              %                           2.88
    L2 Cache Throughput                                                                  %                           1.23
    SM Active Cycles                                                                 cycle                         428.28
    Compute (SM) [%]                                                                     %                           1.18
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                          16
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68             
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                              8
    Block Limit Warps                                                                block                              6
    Theoretical Active Warps per SM                                                   warp                             48
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          16.54
    Achieved Active Warps Per SM                                                      warp                           7.94
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated     
          theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads   
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block    
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide                                
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.                                                                                         

  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator &, bool)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 8)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:33, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           7.36
    SM Frequency                                                             cycle/nsecond                           1.12
    Elapsed Cycles                                                                   cycle                          7,276
    Memory [%]                                                                           %                          35.26
    DRAM Throughput                                                                      %                           2.57
    Duration                                                                       usecond                           6.50
    L1/TEX Cache Throughput                                                              %                          37.75
    L2 Cache Throughput                                                                  %                          35.26
    SM Active Cycles                                                                 cycle                       4,770.31
    Compute (SM) [%]                                                                     %                          30.43
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         960
    Registers Per Thread                                                   register/thread                             20
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         61,440
    Waves Per SM                                                                                                     0.88
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             42
    Block Limit Shared Mem                                                           block                             16
    Block Limit Warps                                                                block                             24
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          47.21
    Achieved Active Warps Per SM                                                      warp                          22.66
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (66.7%) and measured achieved occupancy (47.2%) can be the result of warp      
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy.