==PROF== Connected to process 767673 (/usr/bin/python3.8) ==PROF== Profiling "vectorized_elementwise_kernel" - 0 (1/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 1 (2/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 2 (3/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 3 (4/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 4 (5/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 5 (6/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 6 (7/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 7 (8/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 8 (9/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 9 (10/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 10 (11/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 11 (12/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 12 (13/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 13 (14/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 14 (15/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 15 (16/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 16 (17/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 17 (18/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 18 (19/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 19 (20/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 20 (21/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 21 (22/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 22 (23/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 23 (24/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 24 (25/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 25 (26/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 26 (27/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 27 (28/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 28 (29/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 29 (30/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 30 (31/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 31 (32/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 32 (33/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 33 (34/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 34 (35/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 35 (36/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 36 (37/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 37 (38/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 38 (39/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 39 (40/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 40 (41/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 41 (42/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 42 (43/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 43 (44/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 44 (45/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 45 (46/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 46 (47/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 47 (48/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 48 (49/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 49 (50/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 50 (51/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 51 (52/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 52 (53/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 53 (54/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 54 (55/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 55 (56/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 56 (57/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 57 (58/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 58 (59/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 59 (60/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 60 (61/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 61 (62/300): 0%....50%....100% - 9 passes ==PROF== Profiling "distribution_elementwise_grid..." - 62 (63/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 63 (64/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 64 (65/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 65 (66/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 66 (67/300): 0%....50%....100% - 9 passes ==PROF== Profiling "transpose_readWrite_alignment..." - 67 (68/300): 0%....50%....100% - 9 passes ==PROF== Profiling "transpose_readWrite_alignment..." - 68 (69/300): 0%....50%....100% - 9 passes ==PROF== Profiling "transpose_readWrite_alignment..." - 69 (70/300): 0%....50%....100% - 9 passes ==PROF== Profiling "transpose_readWrite_alignment..." - 70 (71/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 71 (72/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 72 (73/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 73 (74/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 74 (75/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 75 (76/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 76 (77/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 77 (78/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 78 (79/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 79 (80/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 80 (81/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 81 (82/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 82 (83/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 83 (84/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 84 (85/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 85 (86/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 86 (87/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 87 (88/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 88 (89/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 89 (90/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 90 (91/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 91 (92/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 92 (93/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 93 (94/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 94 (95/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 95 (96/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 96 (97/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 97 (98/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 98 (99/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 99 (100/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 100 (101/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 101 (102/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 102 (103/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 103 (104/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 104 (105/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 105 (106/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 106 (107/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 107 (108/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 108 (109/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 109 (110/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 110 (111/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 111 (112/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 112 (113/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 113 (114/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 114 (115/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 115 (116/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 116 (117/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 117 (118/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 118 (119/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 119 (120/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 120 (121/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 121 (122/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 122 (123/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 123 (124/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 124 (125/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 125 (126/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 126 (127/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 127 (128/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 128 (129/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 129 (130/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 130 (131/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 131 (132/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 132 (133/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 133 (134/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 134 (135/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 135 (136/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 136 (137/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 137 (138/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 138 (139/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 139 (140/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 140 (141/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 141 (142/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 142 (143/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 143 (144/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 144 (145/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 145 (146/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 146 (147/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 147 (148/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 148 (149/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 149 (150/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 150 (151/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 151 (152/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 152 (153/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 153 (154/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 154 (155/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 155 (156/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 156 (157/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 157 (158/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 158 (159/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 159 (160/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 160 (161/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 161 (162/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 162 (163/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 163 (164/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 164 (165/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 165 (166/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 166 (167/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 167 (168/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 168 (169/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 169 (170/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 170 (171/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 171 (172/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 172 (173/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 173 (174/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 174 (175/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 175 (176/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 176 (177/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 177 (178/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 178 (179/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 179 (180/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 180 (181/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 181 (182/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 182 (183/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 183 (184/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 184 (185/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 185 (186/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 186 (187/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 187 (188/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 188 (189/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 189 (190/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 190 (191/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 191 (192/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 192 (193/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 193 (194/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 194 (195/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 195 (196/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 196 (197/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 197 (198/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 198 (199/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 199 (200/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 200 (201/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 201 (202/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 202 (203/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 203 (204/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 204 (205/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 205 (206/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 206 (207/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 207 (208/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 208 (209/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 209 (210/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 210 (211/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 211 (212/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 212 (213/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 213 (214/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 214 (215/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 215 (216/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 216 (217/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 217 (218/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 218 (219/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 219 (220/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 220 (221/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 221 (222/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 222 (223/300): 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_sgemm_32x32_sliced1x4_tn" - 223 (224/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 224 (225/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 225 (226/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 226 (227/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 227 (228/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 228 (229/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 229 (230/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 230 (231/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 231 (232/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 232 (233/300): 0%....50%....100% - 9 passes ==PROF== Profiling "softmax_warp_forward" - 233 (234/300): 0%....50%....100% - 9 passes ==PROF== Profiling "gemv2N_kernel" - 234 (235/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 235 (236/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 236 (237/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 237 (238/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 238 (239/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 239 (240/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 240 (241/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 241 (242/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 242 (243/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 243 (244/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 244 (245/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 245 (246/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 246 (247/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 247 (248/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 248 (249/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 249 (250/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 250 (251/300): 0%....50%....100% - 9 passes ==PROF== Profiling "softmax_warp_forward" - 251 (252/300): 0%....50%....100% - 9 passes ==PROF== Profiling "gemv2N_kernel" - 252 (253/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 253 (254/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 254 (255/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 255 (256/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 256 (257/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 257 (258/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 258 (259/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 259 (260/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 260 (261/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 261 (262/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 262 (263/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 263 (264/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 264 (265/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 265 (266/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 266 (267/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 267 (268/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 268 (269/300): 0%....50%....100% - 9 passes ==PROF== Profiling "softmax_warp_forward" - 269 (270/300): 0%....50%....100% - 9 passes ==PROF== Profiling "gemv2N_kernel" - 270 (271/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 271 (272/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 272 (273/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 273 (274/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 274 (275/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 275 (276/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 276 (277/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 277 (278/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 278 (279/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 279 (280/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 280 (281/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 281 (282/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 282 (283/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 283 (284/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 284 (285/300): 0%....50%....100% - 9 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 285 (286/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 286 (287/300): 0%....50%....100% - 9 passes ==PROF== Profiling "softmax_warp_forward" - 287 (288/300): 0%....50%....100% - 9 passes ==PROF== Profiling "gemv2N_kernel" - 288 (289/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 289 (290/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 290 (291/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 291 (292/300): 0%....50%....100% - 9 passes ==PROF== Profiling "GRU_elementWise_fp" - 292 (293/300): 0%....50%....100% - 9 passes ==PROF== Profiling "CatArrayBatchedCopy" - 293 (294/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 294 (295/300): 0%....50%....100% - 9 passes ==PROF== Profiling "Kernel" - 295 (296/300): 0%....50%....100% - 9 passes ==PROF== Profiling "reduce_kernel" - 296 (297/300): 0%....50%....100% - 9 passes ==PROF== Profiling "indexSelectLargeIndex" - 297 (298/300): 0%....50%....100% - 9 passes ==PROF== Profiling "fused_dropout_kernel_vec" - 298 (299/300): 0%....50%....100% - 9 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 299 (300/300): 0%....50%....100% - 9 passes ==PROF== Trying to shutdown target application ==ERROR== The application returned an error code (9). [767673] python3.8@127.0.0.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.71 SM Frequency cycle/nsecond 1.20 Elapsed Cycles cycle 2,990 Memory [%] % 4.82 DRAM Throughput % 0.01 Duration usecond 2.50 L1/TEX Cache Throughput % 8.10 L2 Cache Throughput % 4.82 SM Active Cycles cycle 878.72 Compute (SM) [%] % 0.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 147 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 9,408 Waves Per SM 0.14 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 8.37 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.56 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 3,018 Memory [%] % 5.43 DRAM Throughput % 0.02 Duration usecond 2.56 L1/TEX Cache Throughput % 8.97 L2 Cache Throughput % 5.43 SM Active Cycles cycle 912.01 Compute (SM) [%] % 0.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.2 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 170 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 10,880 Waves Per SM 0.16 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 9.31 Achieved Active Warps Per SM warp 4.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (9.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.07 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 11,550 Memory [%] % 17.27 DRAM Throughput % 0.01 Duration usecond 8.51 L1/TEX Cache Throughput % 19.71 L2 Cache Throughput % 17.27 SM Active Cycles cycle 9,466.43 Compute (SM) [%] % 55.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 71.10 Achieved Active Warps Per SM warp 34.13 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (71.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.77 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,057 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.75 L2 Cache Throughput % 1.20 SM Active Cycles cycle 753.69 Compute (SM) [%] % 5.77 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.46 Achieved Active Warps Per SM warp 7.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.93 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,188 Memory [%] % 1.35 DRAM Throughput % 0.01 Duration usecond 3.04 L1/TEX Cache Throughput % 3.80 L2 Cache Throughput % 1.32 SM Active Cycles cycle 1,486.41 Compute (SM) [%] % 11.17 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.63 Achieved Active Warps Per SM warp 7.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.42 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 2,762 Memory [%] % 0.68 DRAM Throughput % 0.01 Duration usecond 2.40 L1/TEX Cache Throughput % 66.16 L2 Cache Throughput % 0.68 SM Active Cycles cycle 13.60 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.21 Achieved Active Warps Per SM warp 2.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.24 SM Frequency cycle/nsecond 1.14 Elapsed Cycles cycle 2,765 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.43 L1/TEX Cache Throughput % 65.38 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.76 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.12 Achieved Active Warps Per SM warp 1.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.74 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,039 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.80 L2 Cache Throughput % 1.20 SM Active Cycles cycle 743.13 Compute (SM) [%] % 5.79 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.72 Achieved Active Warps Per SM warp 8.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 4,087 Memory [%] % 1.38 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.78 L2 Cache Throughput % 1.35 SM Active Cycles cycle 1,495.85 Compute (SM) [%] % 11.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.50 Achieved Active Warps Per SM warp 7.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.48 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 2,787 Memory [%] % 0.68 DRAM Throughput % 0.02 Duration usecond 2.37 L1/TEX Cache Throughput % 66.09 L2 Cache Throughput % 0.68 SM Active Cycles cycle 13.62 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 2.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.52 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,775 Memory [%] % 0.68 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 66.16 L2 Cache Throughput % 0.68 SM Active Cycles cycle 13.60 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 2.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.77 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,057 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.20 SM Active Cycles cycle 997.71 Compute (SM) [%] % 7.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.60 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.30 SM Frequency cycle/nsecond 1.14 Elapsed Cycles cycle 2,738 Memory [%] % 0.69 DRAM Throughput % 0.02 Duration usecond 2.40 L1/TEX Cache Throughput % 67.92 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.25 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.12 Achieved Active Warps Per SM warp 1.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.04 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 11,581 Memory [%] % 17.25 DRAM Throughput % 0.01 Duration usecond 8.54 L1/TEX Cache Throughput % 19.64 L2 Cache Throughput % 17.25 SM Active Cycles cycle 9,456.69 Compute (SM) [%] % 55.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 70.97 Achieved Active Warps Per SM warp 34.07 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (71.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.74 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,046 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.79 L2 Cache Throughput % 1.20 SM Active Cycles cycle 744.32 Compute (SM) [%] % 5.79 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.58 Achieved Active Warps Per SM warp 7.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.72 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,090 Memory [%] % 1.38 DRAM Throughput % 0.01 Duration usecond 3.04 L1/TEX Cache Throughput % 3.80 L2 Cache Throughput % 1.35 SM Active Cycles cycle 1,487.81 Compute (SM) [%] % 11.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.58 Achieved Active Warps Per SM warp 7.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.50 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,746 Memory [%] % 0.69 DRAM Throughput % 0.02 Duration usecond 2.34 L1/TEX Cache Throughput % 66.09 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.62 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.15 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.44 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,748 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 65.45 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.75 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.51 Achieved Active Warps Per SM warp 2.17 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.68 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,048 Memory [%] % 1.20 DRAM Throughput % 0 Duration usecond 3.01 L1/TEX Cache Throughput % 3.79 L2 Cache Throughput % 1.20 SM Active Cycles cycle 744.22 Compute (SM) [%] % 5.77 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.68 Achieved Active Warps Per SM warp 8.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.60 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 4,083 Memory [%] % 1.39 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.35 SM Active Cycles cycle 1,495.96 Compute (SM) [%] % 11.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.61 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.44 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,755 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 65.38 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.76 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.63 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 2,754 Memory [%] % 0.69 DRAM Throughput % 0.02 Duration usecond 2.34 L1/TEX Cache Throughput % 66.23 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.59 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 2.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.59 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,076 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 3.04 L1/TEX Cache Throughput % 3.80 L2 Cache Throughput % 1.20 SM Active Cycles cycle 992 Compute (SM) [%] % 7.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.81 Achieved Active Warps Per SM warp 8.07 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.45 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 2,699 Memory [%] % 0.70 DRAM Throughput % 0.01 Duration usecond 2.34 L1/TEX Cache Throughput % 67.18 L2 Cache Throughput % 0.70 SM Active Cycles cycle 13.40 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.74 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,034 Memory [%] % 1.19 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 4.89 L2 Cache Throughput % 1.19 SM Active Cycles cycle 186 Compute (SM) [%] % 1.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 6 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 1,536 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.61 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.45 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,737 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.34 L1/TEX Cache Throughput % 67.25 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.38 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.04 Achieved Active Warps Per SM warp 1.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.74 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,029 Memory [%] % 1.19 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 4.86 L2 Cache Throughput % 1.19 SM Active Cycles cycle 187.18 Compute (SM) [%] % 1.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 6 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 1,536 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.91 Achieved Active Warps Per SM warp 8.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.36 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,703 Memory [%] % 0.82 DRAM Throughput % 0.02 Duration usecond 2.34 L1/TEX Cache Throughput % 67.48 L2 Cache Throughput % 0.82 SM Active Cycles cycle 13.34 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.06 Achieved Active Warps Per SM warp 1.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.04 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 8,385 Memory [%] % 13.66 DRAM Throughput % 0.00 Duration usecond 6.18 L1/TEX Cache Throughput % 15.34 L2 Cache Throughput % 13.66 SM Active Cycles cycle 6,309.54 Compute (SM) [%] % 48.26 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 71.84 Achieved Active Warps Per SM warp 34.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (71.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.80 SM Frequency cycle/nsecond 1.50 Elapsed Cycles cycle 5,344 Memory [%] % 2.65 DRAM Throughput % 0.00 Duration usecond 3.55 L1/TEX Cache Throughput % 5.35 L2 Cache Throughput % 2.20 SM Active Cycles cycle 2,638.31 Compute (SM) [%] % 21.78 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 28.34 Achieved Active Warps Per SM warp 13.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (28.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 4,091 Memory [%] % 1.38 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.34 SM Active Cycles cycle 1,499.13 Compute (SM) [%] % 11.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.47 Achieved Active Warps Per SM warp 7.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.39 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,751 Memory [%] % 0.81 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 64.90 L2 Cache Throughput % 0.81 SM Active Cycles cycle 13.87 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.13 Achieved Active Warps Per SM warp 1.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.52 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,773 Memory [%] % 0.68 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 66.16 L2 Cache Throughput % 0.68 SM Active Cycles cycle 13.60 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.21 Achieved Active Warps Per SM warp 2.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.18 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 25,032 Memory [%] % 52.35 DRAM Throughput % 52.35 Duration usecond 18.34 L1/TEX Cache Throughput % 35.82 L2 Cache Throughput % 30.73 SM Active Cycles cycle 22,853.10 Compute (SM) [%] % 65.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 69.66 Achieved Active Warps Per SM warp 33.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (69.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.49 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,847 Memory [%] % 1.78 DRAM Throughput % 0.02 Duration usecond 2.43 L1/TEX Cache Throughput % 3.92 L2 Cache Throughput % 1.78 SM Active Cycles cycle 559.21 Compute (SM) [%] % 0.26 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 43 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,752 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.13 Achieved Active Warps Per SM warp 1.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.67 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,011 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 4.82 L2 Cache Throughput % 1.20 SM Active Cycles cycle 188.38 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 6 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 1,536 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.40 Achieved Active Warps Per SM warp 7.87 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.45 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 2,694 Memory [%] % 0.70 DRAM Throughput % 0.01 Duration usecond 2.34 L1/TEX Cache Throughput % 67.25 L2 Cache Throughput % 0.70 SM Active Cycles cycle 13.38 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.07 Achieved Active Warps Per SM warp 1.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.99 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 8,375 Memory [%] % 13.68 DRAM Throughput % 0.00 Duration usecond 6.21 L1/TEX Cache Throughput % 15.35 L2 Cache Throughput % 13.68 SM Active Cycles cycle 6,309.60 Compute (SM) [%] % 48.28 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 71.81 Achieved Active Warps Per SM warp 34.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (71.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.86 SM Frequency cycle/nsecond 1.51 Elapsed Cycles cycle 5,405 Memory [%] % 2.62 DRAM Throughput % 0.00 Duration usecond 3.58 L1/TEX Cache Throughput % 5.34 L2 Cache Throughput % 2.19 SM Active Cycles cycle 2,641.38 Compute (SM) [%] % 21.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 28.46 Achieved Active Warps Per SM warp 13.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (28.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 4,089 Memory [%] % 1.38 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.35 SM Active Cycles cycle 1,496.04 Compute (SM) [%] % 11.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.60 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.39 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 2,726 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 63.75 L2 Cache Throughput % 0.69 SM Active Cycles cycle 14.12 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.01 Achieved Active Warps Per SM warp 1.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.30 SM Frequency cycle/nsecond 1.14 Elapsed Cycles cycle 2,743 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.40 L1/TEX Cache Throughput % 66.09 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.62 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.17 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 25,082 Memory [%] % 52.30 DRAM Throughput % 52.30 Duration usecond 18.37 L1/TEX Cache Throughput % 35.78 L2 Cache Throughput % 30.72 SM Active Cycles cycle 22,831.90 Compute (SM) [%] % 65.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 70.64 Achieved Active Warps Per SM warp 33.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (70.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.54 SM Frequency cycle/nsecond 1.17 Elapsed Cycles cycle 2,855 Memory [%] % 1.77 DRAM Throughput % 0.01 Duration usecond 2.43 L1/TEX Cache Throughput % 4.04 L2 Cache Throughput % 1.77 SM Active Cycles cycle 542.82 Compute (SM) [%] % 0.26 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 43 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,752 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.18 Achieved Active Warps Per SM warp 2.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.00 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 11,534 Memory [%] % 17.31 DRAM Throughput % 0.18 Duration usecond 8.54 L1/TEX Cache Throughput % 19.72 L2 Cache Throughput % 17.31 SM Active Cycles cycle 9,463.37 Compute (SM) [%] % 55.72 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 71.31 Achieved Active Warps Per SM warp 34.23 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (71.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.77 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,049 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.78 L2 Cache Throughput % 1.20 SM Active Cycles cycle 747.21 Compute (SM) [%] % 5.78 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.50 Achieved Active Warps Per SM warp 7.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.32 Elapsed Cycles cycle 4,073 Memory [%] % 1.39 DRAM Throughput % 0 Duration usecond 3.07 L1/TEX Cache Throughput % 3.78 L2 Cache Throughput % 1.35 SM Active Cycles cycle 1,495.24 Compute (SM) [%] % 11.50 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.59 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.54 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 2,753 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.34 L1/TEX Cache Throughput % 63.22 L2 Cache Throughput % 0.69 SM Active Cycles cycle 14.24 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 3.97 Achieved Active Warps Per SM warp 1.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.39 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,748 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 65.52 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.74 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.12 Achieved Active Warps Per SM warp 1.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.67 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,030 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.20 SM Active Cycles cycle 748.04 Compute (SM) [%] % 5.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 24 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 6,144 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 24 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.60 Achieved Active Warps Per SM warp 7.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.67 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,117 Memory [%] % 1.37 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.57 L2 Cache Throughput % 1.34 SM Active Cycles cycle 1,583.29 Compute (SM) [%] % 11.36 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.69 Achieved Active Warps Per SM warp 7.53 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.35 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 2,729 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 66.02 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.63 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.15 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.44 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,748 Memory [%] % 0.69 DRAM Throughput % 0.02 Duration usecond 2.37 L1/TEX Cache Throughput % 66.31 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.57 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.18 Achieved Active Warps Per SM warp 2.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.77 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,072 Memory [%] % 1.20 DRAM Throughput % 0.01 Duration usecond 2.98 L1/TEX Cache Throughput % 3.77 L2 Cache Throughput % 1.20 SM Active Cycles cycle 998.87 Compute (SM) [%] % 7.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.67 Achieved Active Warps Per SM warp 8.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.16 SM Frequency cycle/nsecond 1.11 Elapsed Cycles cycle 2,701 Memory [%] % 0.70 DRAM Throughput % 0.01 Duration usecond 2.43 L1/TEX Cache Throughput % 67.77 L2 Cache Throughput % 0.70 SM Active Cycles cycle 13.28 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.10 Achieved Active Warps Per SM warp 1.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.70 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 3,990 Memory [%] % 1.20 DRAM Throughput % 0 Duration usecond 2.94 L1/TEX Cache Throughput % 4.88 L2 Cache Throughput % 1.20 SM Active Cycles cycle 186.38 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 6 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 1,536 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 6 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.65 Achieved Active Warps Per SM warp 7.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.41 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,715 Memory [%] % 0.70 DRAM Throughput % 0.01 Duration usecond 2.34 L1/TEX Cache Throughput % 67.40 L2 Cache Throughput % 0.70 SM Active Cycles cycle 13.35 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 2.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.96 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 8,343 Memory [%] % 13.69 DRAM Throughput % 0.00 Duration usecond 6.21 L1/TEX Cache Throughput % 15.41 L2 Cache Throughput % 13.69 SM Active Cycles cycle 6,298.04 Compute (SM) [%] % 48.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 72.14 Achieved Active Warps Per SM warp 34.63 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (72.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.80 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,827 Memory [%] % 2.93 DRAM Throughput % 0.01 Duration usecond 3.58 L1/TEX Cache Throughput % 5.30 L2 Cache Throughput % 2.44 SM Active Cycles cycle 2,665.75 Compute (SM) [%] % 24.10 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 28.27 Achieved Active Warps Per SM warp 13.57 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (28.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.60 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 4,084 Memory [%] % 1.38 DRAM Throughput % 0.01 Duration usecond 3.07 L1/TEX Cache Throughput % 3.75 L2 Cache Throughput % 1.34 SM Active Cycles cycle 1,504.85 Compute (SM) [%] % 11.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 48 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 12,288 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 48 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.41 Achieved Active Warps Per SM warp 7.88 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.54 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 2,746 Memory [%] % 0.69 DRAM Throughput % 0.02 Duration usecond 2.34 L1/TEX Cache Throughput % 66.02 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.63 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.16 Achieved Active Warps Per SM warp 2.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.48 SM Frequency cycle/nsecond 1.16 Elapsed Cycles cycle 2,756 Memory [%] % 0.69 DRAM Throughput % 0.01 Duration usecond 2.37 L1/TEX Cache Throughput % 65.11 L2 Cache Throughput % 0.69 SM Active Cycles cycle 13.82 Compute (SM) [%] % 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 64 Waves Per SM 0.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.14 Achieved Active Warps Per SM warp 1.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::distribution_elementwise_grid_stride_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::::distribution_nullary_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel(at::Tensor &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIterator &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4), 2023-Apr-06 16:56:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.21 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 25,079 Memory [%] % 52.12 DRAM Throughput % 52.12 Duration usecond 18.34 L1/TEX Cache Throughput % 35.76 L2 Cache Throughput % 30.61 SM Active Cycles cycle 22,854.72 Compute (SM) [%] % 65.86 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 27 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 69.72 Achieved Active Warps Per SM warp 33.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (69.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.58 SM Frequency cycle/nsecond 1.19 Elapsed Cycles cycle 2,887 Memory [%] % 1.77 DRAM Throughput % 0.01 Duration usecond 2.43 L1/TEX Cache Throughput % 3.85 L2 Cache Throughput % 1.77 SM Active Cycles cycle 570.18 Compute (SM) [%] % 0.26 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 43 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,752 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 43 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 3.97 Achieved Active Warps Per SM warp 1.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:56:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.95 SM Frequency cycle/nsecond 1.22 Elapsed Cycles cycle 6,768 Memory [%] % 9.18 DRAM Throughput % 2.83 Duration usecond 5.54 L1/TEX Cache Throughput % 8.93 L2 Cache Throughput % 9.18 SM Active Cycles cycle 4,508.90 Compute (SM) [%] % 25.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.7 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 544 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 69,632 Waves Per SM 0.67 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 55.75 Achieved Active Warps Per SM warp 26.76 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (55.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:56:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.20 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 6,046 Memory [%] % 15.63 DRAM Throughput % 15.51 Duration usecond 4.83 L1/TEX Cache Throughput % 10.38 L2 Cache Throughput % 15.63 SM Active Cycles cycle 3,739.81 Compute (SM) [%] % 24.79 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 104,448 Waves Per SM 1 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 77.63 Achieved Active Warps Per SM warp 37.26 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (77.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array>(int, T2, T3), 2023-Apr-06 16:56:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.62 SM Frequency cycle/nsecond 1.19 Elapsed Cycles cycle 2,894 Memory [%] % 2.53 DRAM Throughput % 0.02 Duration usecond 2.43 L1/TEX Cache Throughput % 3.80 L2 Cache Throughput % 2.53 SM Active Cycles cycle 845.90 Compute (SM) [%] % 0.37 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 64 Registers Per Thread register/thread 16 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 64 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 3.94 Achieved Active Warps Per SM warp 1.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (3.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void transpose_readWrite_alignment_kernel(cublasTransposeParams, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.32 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 5,709 Memory [%] % 1.30 DRAM Throughput % 0.84 Duration usecond 4.42 L1/TEX Cache Throughput % 31.64 L2 Cache Throughput % 1.30 SM Active Cycles cycle 147.24 Compute (SM) [%] % 0.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 3 Registers Per Thread register/thread 48 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 8.32 Threads thread 768 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 7 Block Limit Warps block 6 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 16.11 Achieved Active Warps Per SM warp 7.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void transpose_readWrite_alignment_kernel(cublasTransposeParams, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 6,062 Memory [%] % 1.71 DRAM Throughput % 1.71 Duration usecond 4.77 L1/TEX Cache Throughput % 36.39 L2 Cache Throughput % 1.69 SM Active Cycles cycle 186.21 Compute (SM) [%] % 1.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 3 Registers Per Thread register/thread 48 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 8.32 Threads thread 768 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 7 Block Limit Warps block 6 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void transpose_readWrite_alignment_kernel(cublasTransposeParams, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.88 SM Frequency cycle/nsecond 1.23 Elapsed Cycles cycle 5,190 Memory [%] % 1.43 DRAM Throughput % 0.93 Duration usecond 4.22 L1/TEX Cache Throughput % 31.99 L2 Cache Throughput % 1.43 SM Active Cycles cycle 145.62 Compute (SM) [%] % 0.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 3 Registers Per Thread register/thread 48 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 8.32 Threads thread 768 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 7 Block Limit Warps block 6 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 16.55 Achieved Active Warps Per SM warp 7.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void transpose_readWrite_alignment_kernel(cublasTransposeParams, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.08 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 5,942 Memory [%] % 1.65 DRAM Throughput % 1.62 Duration usecond 4.70 L1/TEX Cache Throughput % 38.02 L2 Cache Throughput % 1.65 SM Active Cycles cycle 178.24 Compute (SM) [%] % 1.14 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 3 Registers Per Thread register/thread 48 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 8.32 Threads thread 768 Waves Per SM 0.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 3 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 7 Block Limit Warps block 6 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 16.63 Achieved Active Warps Per SM warp 7.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.11 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,122 Memory [%] % 4.92 DRAM Throughput % 1.57 Duration usecond 5.63 L1/TEX Cache Throughput % 24.12 L2 Cache Throughput % 4.92 SM Active Cycles cycle 935.53 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.35 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.27 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 7,256 Memory [%] % 4.83 DRAM Throughput % 1.54 Duration usecond 5.63 L1/TEX Cache Throughput % 23.78 L2 Cache Throughput % 4.83 SM Active Cycles cycle 950.69 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:28, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.51 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 8,773 Memory [%] % 3.28 DRAM Throughput % 1.83 Duration usecond 6.59 L1/TEX Cache Throughput % 24.97 L2 Cache Throughput % 3.28 SM Active Cycles cycle 586.49 Compute (SM) [%] % 1.41 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:29, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.87 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 5,056 Memory [%] % 8.93 DRAM Throughput % 8.93 Duration usecond 3.65 L1/TEX Cache Throughput % 10.45 L2 Cache Throughput % 7.89 SM Active Cycles cycle 1,369.06 Compute (SM) [%] % 2.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.93 Achieved Active Warps Per SM warp 7.65 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:29, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.92 SM Frequency cycle/nsecond 1.23 Elapsed Cycles cycle 8,302 Memory [%] % 3.47 DRAM Throughput % 1.92 Duration usecond 6.72 L1/TEX Cache Throughput % 25.39 L2 Cache Throughput % 3.47 SM Active Cycles cycle 577.35 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:29, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.65 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,926 Memory [%] % 9.16 DRAM Throughput % 9.16 Duration usecond 3.65 L1/TEX Cache Throughput % 10.94 L2 Cache Throughput % 8.10 SM Active Cycles cycle 1,307.31 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.85 Achieved Active Warps Per SM warp 8.09 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:30, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.40 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,716 Memory [%] % 3.30 DRAM Throughput % 1.83 Duration usecond 6.66 L1/TEX Cache Throughput % 24.98 L2 Cache Throughput % 3.30 SM Active Cycles cycle 585.04 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:30, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,993 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.71 L1/TEX Cache Throughput % 10.37 L2 Cache Throughput % 7.97 SM Active Cycles cycle 1,380.16 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.99 Achieved Active Warps Per SM warp 7.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:30, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.20 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,414 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.59 L1/TEX Cache Throughput % 25.06 L2 Cache Throughput % 3.42 SM Active Cycles cycle 584.99 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:30, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.91 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,907 Memory [%] % 9.12 DRAM Throughput % 9.12 Duration usecond 3.55 L1/TEX Cache Throughput % 10.48 L2 Cache Throughput % 8.10 SM Active Cycles cycle 1,365.68 Compute (SM) [%] % 3.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.93 Achieved Active Warps Per SM warp 7.64 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.18 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,110 Memory [%] % 4.94 DRAM Throughput % 1.58 Duration usecond 5.57 L1/TEX Cache Throughput % 23.59 L2 Cache Throughput % 4.94 SM Active Cycles cycle 941.12 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.23 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,243 Memory [%] % 4.85 DRAM Throughput % 1.54 Duration usecond 5.66 L1/TEX Cache Throughput % 23.24 L2 Cache Throughput % 4.85 SM Active Cycles cycle 964.50 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:31, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.36 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 8,584 Memory [%] % 3.35 DRAM Throughput % 1.86 Duration usecond 6.59 L1/TEX Cache Throughput % 24.75 L2 Cache Throughput % 3.35 SM Active Cycles cycle 593.90 Compute (SM) [%] % 1.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.37 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:31, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.20 SM Frequency cycle/nsecond 1.44 Elapsed Cycles cycle 5,199 Memory [%] % 8.68 DRAM Throughput % 8.68 Duration usecond 3.62 L1/TEX Cache Throughput % 10.50 L2 Cache Throughput % 7.66 SM Active Cycles cycle 1,362.65 Compute (SM) [%] % 2.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.31 Achieved Active Warps Per SM warp 7.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:32, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,333 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 25.31 L2 Cache Throughput % 3.45 SM Active Cycles cycle 578.85 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:32, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.77 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 5,054 Memory [%] % 8.87 DRAM Throughput % 8.87 Duration usecond 3.71 L1/TEX Cache Throughput % 10.40 L2 Cache Throughput % 7.88 SM Active Cycles cycle 1,376.18 Compute (SM) [%] % 2.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.80 Achieved Active Warps Per SM warp 7.58 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:32, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.39 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,637 Memory [%] % 3.33 DRAM Throughput % 1.85 Duration usecond 6.59 L1/TEX Cache Throughput % 24.74 L2 Cache Throughput % 3.33 SM Active Cycles cycle 591.04 Compute (SM) [%] % 1.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:33, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.48 SM Frequency cycle/nsecond 1.32 Elapsed Cycles cycle 5,065 Memory [%] % 8.87 DRAM Throughput % 8.87 Duration usecond 3.84 L1/TEX Cache Throughput % 10.35 L2 Cache Throughput % 7.87 SM Active Cycles cycle 1,381.91 Compute (SM) [%] % 2.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.70 Achieved Active Warps Per SM warp 7.54 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:33, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.05 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 8,319 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.62 L1/TEX Cache Throughput % 25.29 L2 Cache Throughput % 3.46 SM Active Cycles cycle 578.35 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:33, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.11 SM Frequency cycle/nsecond 1.42 Elapsed Cycles cycle 5,062 Memory [%] % 8.93 DRAM Throughput % 8.93 Duration usecond 3.55 L1/TEX Cache Throughput % 10.48 L2 Cache Throughput % 7.84 SM Active Cycles cycle 1,365.50 Compute (SM) [%] % 2.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.78 Achieved Active Warps Per SM warp 7.57 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:33, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.09 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,080 Memory [%] % 4.96 DRAM Throughput % 1.58 Duration usecond 5.63 L1/TEX Cache Throughput % 24.20 L2 Cache Throughput % 4.96 SM Active Cycles cycle 935.44 Compute (SM) [%] % 2.76 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:34, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.14 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,190 Memory [%] % 4.88 DRAM Throughput % 1.56 Duration usecond 5.66 L1/TEX Cache Throughput % 23.61 L2 Cache Throughput % 4.88 SM Active Cycles cycle 955.99 Compute (SM) [%] % 2.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:34, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.22 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,487 Memory [%] % 3.39 DRAM Throughput % 1.88 Duration usecond 6.62 L1/TEX Cache Throughput % 25.02 L2 Cache Throughput % 3.39 SM Active Cycles cycle 585.12 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:34, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.78 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,941 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.62 L1/TEX Cache Throughput % 10.46 L2 Cache Throughput % 8.05 SM Active Cycles cycle 1,367.53 Compute (SM) [%] % 2.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.13 Achieved Active Warps Per SM warp 7.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:35, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.26 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,367 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.50 L1/TEX Cache Throughput % 25.27 L2 Cache Throughput % 3.44 SM Active Cycles cycle 580.94 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.79 Achieved Active Warps Per SM warp 4.22 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:35, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 10.03 SM Frequency cycle/nsecond 1.56 Elapsed Cycles cycle 5,899 Memory [%] % 7.62 DRAM Throughput % 7.62 Duration usecond 3.78 L1/TEX Cache Throughput % 10.09 L2 Cache Throughput % 6.72 SM Active Cycles cycle 1,417.19 Compute (SM) [%] % 2.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.97 Achieved Active Warps Per SM warp 7.67 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:35, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.26 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,413 Memory [%] % 3.42 DRAM Throughput % 1.89 Duration usecond 6.56 L1/TEX Cache Throughput % 25.00 L2 Cache Throughput % 3.42 SM Active Cycles cycle 586.60 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.25 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:35, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.86 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,995 Memory [%] % 9.02 DRAM Throughput % 9.02 Duration usecond 3.62 L1/TEX Cache Throughput % 10.58 L2 Cache Throughput % 7.96 SM Active Cycles cycle 1,352.21 Compute (SM) [%] % 2.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:36, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.39 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,549 Memory [%] % 3.37 DRAM Throughput % 1.87 Duration usecond 6.53 L1/TEX Cache Throughput % 25.18 L2 Cache Throughput % 3.37 SM Active Cycles cycle 579.01 Compute (SM) [%] % 1.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:36, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.83 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,920 Memory [%] % 9.12 DRAM Throughput % 9.12 Duration usecond 3.58 L1/TEX Cache Throughput % 10.61 L2 Cache Throughput % 8.10 SM Active Cycles cycle 1,348.37 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.06 Achieved Active Warps Per SM warp 7.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:36, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.28 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,159 Memory [%] % 4.90 DRAM Throughput % 1.56 Duration usecond 5.57 L1/TEX Cache Throughput % 23.87 L2 Cache Throughput % 4.90 SM Active Cycles cycle 945.26 Compute (SM) [%] % 2.72 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.17 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,221 Memory [%] % 4.87 DRAM Throughput % 1.55 Duration usecond 5.66 L1/TEX Cache Throughput % 23.58 L2 Cache Throughput % 4.87 SM Active Cycles cycle 955.88 Compute (SM) [%] % 2.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.18 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,361 Memory [%] % 3.49 DRAM Throughput % 1.91 Duration usecond 6.56 L1/TEX Cache Throughput % 25.07 L2 Cache Throughput % 3.49 SM Active Cycles cycle 584.38 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:37, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.40 SM Frequency cycle/nsecond 1.46 Elapsed Cycles cycle 5,188 Memory [%] % 8.65 DRAM Throughput % 8.65 Duration usecond 3.55 L1/TEX Cache Throughput % 10.69 L2 Cache Throughput % 7.65 SM Active Cycles cycle 1,338.46 Compute (SM) [%] % 2.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.30 Achieved Active Warps Per SM warp 7.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:37, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.21 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,300 Memory [%] % 3.47 DRAM Throughput % 1.92 Duration usecond 6.50 L1/TEX Cache Throughput % 25.27 L2 Cache Throughput % 3.47 SM Active Cycles cycle 579.13 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:38, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.65 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,902 Memory [%] % 9.15 DRAM Throughput % 9.15 Duration usecond 3.65 L1/TEX Cache Throughput % 9.51 L2 Cache Throughput % 8.12 SM Active Cycles cycle 1,504.65 Compute (SM) [%] % 3.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 14.42 Achieved Active Warps Per SM warp 6.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (14.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:38, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.05 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 8,362 Memory [%] % 3.44 DRAM Throughput % 1.92 Duration usecond 6.66 L1/TEX Cache Throughput % 24.84 L2 Cache Throughput % 3.44 SM Active Cycles cycle 588.41 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:38, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,982 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.71 L1/TEX Cache Throughput % 10.42 L2 Cache Throughput % 8.01 SM Active Cycles cycle 1,372.88 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.96 Achieved Active Warps Per SM warp 7.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:38, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.26 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,367 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.50 L1/TEX Cache Throughput % 25.25 L2 Cache Throughput % 3.44 SM Active Cycles cycle 580.40 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:39, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.88 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,921 Memory [%] % 9.15 DRAM Throughput % 9.15 Duration usecond 3.55 L1/TEX Cache Throughput % 10.82 L2 Cache Throughput % 8.10 SM Active Cycles cycle 1,322.51 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.36 Achieved Active Warps Per SM warp 7.85 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:39, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,122 Memory [%] % 4.93 DRAM Throughput % 1.57 Duration usecond 5.60 L1/TEX Cache Throughput % 24.23 L2 Cache Throughput % 4.93 SM Active Cycles cycle 938.99 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:39, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.13 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,208 Memory [%] % 4.87 DRAM Throughput % 1.55 Duration usecond 5.70 L1/TEX Cache Throughput % 23.75 L2 Cache Throughput % 4.87 SM Active Cycles cycle 950.68 Compute (SM) [%] % 2.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:40, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.10 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 8,427 Memory [%] % 3.41 DRAM Throughput % 1.89 Duration usecond 6.69 L1/TEX Cache Throughput % 24.96 L2 Cache Throughput % 3.41 SM Active Cycles cycle 585.49 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:40, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.94 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 4,983 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.58 L1/TEX Cache Throughput % 10.35 L2 Cache Throughput % 7.99 SM Active Cycles cycle 1,381.81 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.01 Achieved Active Warps Per SM warp 7.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:40, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.28 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,378 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.50 L1/TEX Cache Throughput % 25.22 L2 Cache Throughput % 3.44 SM Active Cycles cycle 580.43 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.36 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:40, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.70 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,937 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.65 L1/TEX Cache Throughput % 10.62 L2 Cache Throughput % 8.06 SM Active Cycles cycle 1,347.47 Compute (SM) [%] % 2.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.83 Achieved Active Warps Per SM warp 7.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:41, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.23 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,458 Memory [%] % 3.40 DRAM Throughput % 1.89 Duration usecond 6.59 L1/TEX Cache Throughput % 24.90 L2 Cache Throughput % 3.40 SM Active Cycles cycle 584.41 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.38 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:41, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.81 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,983 Memory [%] % 9.23 DRAM Throughput % 9.23 Duration usecond 3.62 L1/TEX Cache Throughput % 9.33 L2 Cache Throughput % 8.06 SM Active Cycles cycle 1,533.21 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 14.30 Achieved Active Warps Per SM warp 6.86 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (14.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:41, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,327 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 25.23 L2 Cache Throughput % 3.46 SM Active Cycles cycle 578.72 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:42, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.88 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 4,924 Memory [%] % 9.16 DRAM Throughput % 9.16 Duration usecond 3.55 L1/TEX Cache Throughput % 10.24 L2 Cache Throughput % 8.11 SM Active Cycles cycle 1,397.03 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.36 Achieved Active Warps Per SM warp 7.85 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:42, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.18 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,114 Memory [%] % 4.93 DRAM Throughput % 1.58 Duration usecond 5.57 L1/TEX Cache Throughput % 23.82 L2 Cache Throughput % 4.93 SM Active Cycles cycle 944.28 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:42, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.11 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,207 Memory [%] % 4.87 DRAM Throughput % 1.55 Duration usecond 5.70 L1/TEX Cache Throughput % 23.61 L2 Cache Throughput % 4.87 SM Active Cycles cycle 958.75 Compute (SM) [%] % 2.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.43 Achieved Active Warps Per SM warp 4.05 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:43, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.05 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 8,306 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.62 L1/TEX Cache Throughput % 25.03 L2 Cache Throughput % 3.46 SM Active Cycles cycle 584.66 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:43, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.84 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 5,015 Memory [%] % 8.95 DRAM Throughput % 8.95 Duration usecond 3.65 L1/TEX Cache Throughput % 10.58 L2 Cache Throughput % 7.94 SM Active Cycles cycle 1,352.43 Compute (SM) [%] % 2.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.62 Achieved Active Warps Per SM warp 7.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:43, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.13 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,323 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.56 L1/TEX Cache Throughput % 25.18 L2 Cache Throughput % 3.46 SM Active Cycles cycle 579.53 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.34 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:43, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.71 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 4,877 Memory [%] % 9.24 DRAM Throughput % 9.24 Duration usecond 3.58 L1/TEX Cache Throughput % 10.90 L2 Cache Throughput % 8.16 SM Active Cycles cycle 1,312.97 Compute (SM) [%] % 3.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.36 Achieved Active Warps Per SM warp 7.85 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:44, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.14 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,375 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.59 L1/TEX Cache Throughput % 25.01 L2 Cache Throughput % 3.44 SM Active Cycles cycle 585.60 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:44, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.57 SM Frequency cycle/nsecond 1.49 Elapsed Cycles cycle 5,389 Memory [%] % 8.35 DRAM Throughput % 8.35 Duration usecond 3.62 L1/TEX Cache Throughput % 10.67 L2 Cache Throughput % 7.38 SM Active Cycles cycle 1,341.12 Compute (SM) [%] % 2.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.23 Achieved Active Warps Per SM warp 7.79 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:44, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.02 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 8,346 Memory [%] % 3.45 DRAM Throughput % 1.91 Duration usecond 6.69 L1/TEX Cache Throughput % 25.27 L2 Cache Throughput % 3.45 SM Active Cycles cycle 579.63 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:44, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.05 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 4,964 Memory [%] % 9.07 DRAM Throughput % 9.07 Duration usecond 3.52 L1/TEX Cache Throughput % 10.42 L2 Cache Throughput % 8.03 SM Active Cycles cycle 1,373.07 Compute (SM) [%] % 2.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.75 Achieved Active Warps Per SM warp 7.56 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.09 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,127 Memory [%] % 4.98 DRAM Throughput % 1.57 Duration usecond 5.63 L1/TEX Cache Throughput % 24.00 L2 Cache Throughput % 4.98 SM Active Cycles cycle 944.68 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.24 Achieved Active Warps Per SM warp 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.25 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,224 Memory [%] % 4.85 DRAM Throughput % 1.55 Duration usecond 5.63 L1/TEX Cache Throughput % 23.91 L2 Cache Throughput % 4.85 SM Active Cycles cycle 946.29 Compute (SM) [%] % 2.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.35 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:45, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,434 Memory [%] % 3.45 DRAM Throughput % 1.89 Duration usecond 6.62 L1/TEX Cache Throughput % 24.43 L2 Cache Throughput % 3.45 SM Active Cycles cycle 598.07 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.10 Achieved Active Warps Per SM warp 3.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:46, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.94 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 4,971 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.58 L1/TEX Cache Throughput % 10.40 L2 Cache Throughput % 8.01 SM Active Cycles cycle 1,375.13 Compute (SM) [%] % 2.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.11 Achieved Active Warps Per SM warp 7.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:46, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.18 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,317 Memory [%] % 3.46 DRAM Throughput % 1.91 Duration usecond 6.56 L1/TEX Cache Throughput % 25.34 L2 Cache Throughput % 3.46 SM Active Cycles cycle 577.41 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:46, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.81 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,989 Memory [%] % 8.98 DRAM Throughput % 8.98 Duration usecond 3.65 L1/TEX Cache Throughput % 10.39 L2 Cache Throughput % 7.97 SM Active Cycles cycle 1,376.72 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.97 Achieved Active Warps Per SM warp 7.67 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:46, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,411 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.59 L1/TEX Cache Throughput % 25.17 L2 Cache Throughput % 3.42 SM Active Cycles cycle 582.81 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:47, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.70 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,925 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.65 L1/TEX Cache Throughput % 10.54 L2 Cache Throughput % 8.08 SM Active Cycles cycle 1,357.28 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.15 Achieved Active Warps Per SM warp 7.75 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:47, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,337 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 23.63 L2 Cache Throughput % 3.45 SM Active Cycles cycle 620.59 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 7.73 Achieved Active Warps Per SM warp 3.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:47, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.96 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 4,913 Memory [%] % 9.16 DRAM Throughput % 9.16 Duration usecond 3.52 L1/TEX Cache Throughput % 10.75 L2 Cache Throughput % 8.04 SM Active Cycles cycle 1,330.78 Compute (SM) [%] % 3.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.43 Achieved Active Warps Per SM warp 7.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:47, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.17 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,130 Memory [%] % 4.91 DRAM Throughput % 1.57 Duration usecond 5.60 L1/TEX Cache Throughput % 24.06 L2 Cache Throughput % 4.91 SM Active Cycles cycle 944.53 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:48, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.23 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,252 Memory [%] % 4.84 DRAM Throughput % 1.54 Duration usecond 5.66 L1/TEX Cache Throughput % 23.71 L2 Cache Throughput % 4.84 SM Active Cycles cycle 952.57 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:48, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.13 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,421 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.62 L1/TEX Cache Throughput % 25.11 L2 Cache Throughput % 3.42 SM Active Cycles cycle 583.68 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:48, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.66 SM Frequency cycle/nsecond 1.50 Elapsed Cycles cycle 5,485 Memory [%] % 8.19 DRAM Throughput % 8.19 Duration usecond 3.65 L1/TEX Cache Throughput % 10.33 L2 Cache Throughput % 7.23 SM Active Cycles cycle 1,384.63 Compute (SM) [%] % 2.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.65 Achieved Active Warps Per SM warp 7.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:49, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,345 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 25.33 L2 Cache Throughput % 3.45 SM Active Cycles cycle 578.88 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:49, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.68 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 5,071 Memory [%] % 8.81 DRAM Throughput % 8.81 Duration usecond 3.78 L1/TEX Cache Throughput % 9.80 L2 Cache Throughput % 7.81 SM Active Cycles cycle 1,459.62 Compute (SM) [%] % 2.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.85 Achieved Active Warps Per SM warp 7.61 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:49, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.25 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,479 Memory [%] % 3.40 DRAM Throughput % 1.89 Duration usecond 6.59 L1/TEX Cache Throughput % 24.98 L2 Cache Throughput % 3.40 SM Active Cycles cycle 588.66 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.74 Achieved Active Warps Per SM warp 4.20 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:49, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.71 SM Frequency cycle/nsecond 1.51 Elapsed Cycles cycle 5,522 Memory [%] % 8.15 DRAM Throughput % 8.15 Duration usecond 3.65 L1/TEX Cache Throughput % 10.60 L2 Cache Throughput % 7.21 SM Active Cycles cycle 1,349.65 Compute (SM) [%] % 2.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.29 Achieved Active Warps Per SM warp 7.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:50, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.96 SM Frequency cycle/nsecond 1.24 Elapsed Cycles cycle 8,301 Memory [%] % 3.47 DRAM Throughput % 1.93 Duration usecond 6.69 L1/TEX Cache Throughput % 25.48 L2 Cache Throughput % 3.47 SM Active Cycles cycle 575.99 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.39 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:50, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.29 SM Frequency cycle/nsecond 1.44 Elapsed Cycles cycle 5,225 Memory [%] % 8.60 DRAM Throughput % 8.60 Duration usecond 3.62 L1/TEX Cache Throughput % 10.72 L2 Cache Throughput % 7.61 SM Active Cycles cycle 1,334.28 Compute (SM) [%] % 2.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.36 Achieved Active Warps Per SM warp 7.85 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:50, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.09 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,107 Memory [%] % 4.94 DRAM Throughput % 1.58 Duration usecond 5.63 L1/TEX Cache Throughput % 24.14 L2 Cache Throughput % 4.94 SM Active Cycles cycle 935.93 Compute (SM) [%] % 2.75 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:51, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,225 Memory [%] % 4.86 DRAM Throughput % 1.55 Duration usecond 5.66 L1/TEX Cache Throughput % 23.86 L2 Cache Throughput % 4.86 SM Active Cycles cycle 951.41 Compute (SM) [%] % 2.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:51, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.22 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,364 Memory [%] % 3.45 DRAM Throughput % 1.91 Duration usecond 6.53 L1/TEX Cache Throughput % 25.07 L2 Cache Throughput % 3.45 SM Active Cycles cycle 584.26 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.46 Achieved Active Warps Per SM warp 4.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:51, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.86 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,925 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.58 L1/TEX Cache Throughput % 10.59 L2 Cache Throughput % 8.07 SM Active Cycles cycle 1,351.25 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.57 Achieved Active Warps Per SM warp 7.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:52, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.12 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 8,341 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.59 L1/TEX Cache Throughput % 25.32 L2 Cache Throughput % 3.45 SM Active Cycles cycle 578.46 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:52, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.84 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,965 Memory [%] % 9.03 DRAM Throughput % 9.03 Duration usecond 3.62 L1/TEX Cache Throughput % 10.68 L2 Cache Throughput % 8.01 SM Active Cycles cycle 1,339.46 Compute (SM) [%] % 2.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.14 Achieved Active Warps Per SM warp 7.75 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:52, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.35 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 8,485 Memory [%] % 3.40 DRAM Throughput % 1.88 Duration usecond 6.53 L1/TEX Cache Throughput % 24.83 L2 Cache Throughput % 3.40 SM Active Cycles cycle 588.57 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.48 Achieved Active Warps Per SM warp 4.07 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:52, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.05 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 5,240 Memory [%] % 8.59 DRAM Throughput % 8.59 Duration usecond 3.71 L1/TEX Cache Throughput % 10.42 L2 Cache Throughput % 7.60 SM Active Cycles cycle 1,372.66 Compute (SM) [%] % 2.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.25 Achieved Active Warps Per SM warp 7.80 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.31 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 8,476 Memory [%] % 3.40 DRAM Throughput % 1.89 Duration usecond 6.53 L1/TEX Cache Throughput % 24.39 L2 Cache Throughput % 3.40 SM Active Cycles cycle 598.53 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.16 Achieved Active Warps Per SM warp 3.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:53, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.97 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 4,955 Memory [%] % 9.06 DRAM Throughput % 9.06 Duration usecond 3.55 L1/TEX Cache Throughput % 10.74 L2 Cache Throughput % 8.09 SM Active Cycles cycle 1,331.51 Compute (SM) [%] % 2.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.42 Achieved Active Warps Per SM warp 7.88 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,109 Memory [%] % 4.94 DRAM Throughput % 1.57 Duration usecond 5.60 L1/TEX Cache Throughput % 24.07 L2 Cache Throughput % 4.94 SM Active Cycles cycle 938.87 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:53, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,249 Memory [%] % 4.85 DRAM Throughput % 1.55 Duration usecond 5.66 L1/TEX Cache Throughput % 23.40 L2 Cache Throughput % 4.85 SM Active Cycles cycle 955.46 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:54, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.46 SM Frequency cycle/nsecond 1.32 Elapsed Cycles cycle 8,731 Memory [%] % 3.34 DRAM Throughput % 1.92 Duration usecond 6.62 L1/TEX Cache Throughput % 24.69 L2 Cache Throughput % 3.34 SM Active Cycles cycle 591.66 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.15 Achieved Active Warps Per SM warp 3.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:54, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.91 SM Frequency cycle/nsecond 1.54 Elapsed Cycles cycle 5,538 Memory [%] % 8.12 DRAM Throughput % 8.12 Duration usecond 3.58 L1/TEX Cache Throughput % 10.62 L2 Cache Throughput % 7.20 SM Active Cycles cycle 1,346.88 Compute (SM) [%] % 2.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.46 Achieved Active Warps Per SM warp 7.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:54, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.59 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 8,690 Memory [%] % 3.31 DRAM Throughput % 1.84 Duration usecond 6.50 L1/TEX Cache Throughput % 24.52 L2 Cache Throughput % 3.31 SM Active Cycles cycle 596.35 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:55, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.40 SM Frequency cycle/nsecond 1.47 Elapsed Cycles cycle 5,360 Memory [%] % 8.42 DRAM Throughput % 8.42 Duration usecond 3.65 L1/TEX Cache Throughput % 10.58 L2 Cache Throughput % 7.44 SM Active Cycles cycle 1,352.68 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.92 Achieved Active Warps Per SM warp 7.64 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:55, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,388 Memory [%] % 3.43 DRAM Throughput % 1.91 Duration usecond 6.59 L1/TEX Cache Throughput % 25.10 L2 Cache Throughput % 3.43 SM Active Cycles cycle 585.76 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:55, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.68 SM Frequency cycle/nsecond 1.50 Elapsed Cycles cycle 5,480 Memory [%] % 8.31 DRAM Throughput % 8.31 Duration usecond 3.65 L1/TEX Cache Throughput % 10.52 L2 Cache Throughput % 7.26 SM Active Cycles cycle 1,359.32 Compute (SM) [%] % 2.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:55, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.67 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 8,717 Memory [%] % 3.30 DRAM Throughput % 1.83 Duration usecond 6.46 L1/TEX Cache Throughput % 24.98 L2 Cache Throughput % 3.30 SM Active Cycles cycle 585.26 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:56, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.99 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 4,926 Memory [%] % 9.13 DRAM Throughput % 9.13 Duration usecond 3.52 L1/TEX Cache Throughput % 10.75 L2 Cache Throughput % 8.07 SM Active Cycles cycle 1,331.12 Compute (SM) [%] % 2.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 18.08 Achieved Active Warps Per SM warp 8.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (18.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:56, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,143 Memory [%] % 4.91 DRAM Throughput % 1.56 Duration usecond 5.63 L1/TEX Cache Throughput % 24.06 L2 Cache Throughput % 4.91 SM Active Cycles cycle 941.01 Compute (SM) [%] % 2.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:56, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.23 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,268 Memory [%] % 4.84 DRAM Throughput % 1.54 Duration usecond 5.66 L1/TEX Cache Throughput % 23.81 L2 Cache Throughput % 4.84 SM Active Cycles cycle 949.31 Compute (SM) [%] % 2.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:57, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.20 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,385 Memory [%] % 3.43 DRAM Throughput % 1.91 Duration usecond 6.56 L1/TEX Cache Throughput % 24.81 L2 Cache Throughput % 3.43 SM Active Cycles cycle 591.34 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.25 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:57, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.46 SM Frequency cycle/nsecond 1.47 Elapsed Cycles cycle 5,375 Memory [%] % 8.37 DRAM Throughput % 8.37 Duration usecond 3.65 L1/TEX Cache Throughput % 10.43 L2 Cache Throughput % 7.38 SM Active Cycles cycle 1,372.13 Compute (SM) [%] % 2.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.28 Achieved Active Warps Per SM warp 7.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:57, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.12 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 8,358 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.62 L1/TEX Cache Throughput % 25.34 L2 Cache Throughput % 3.44 SM Active Cycles cycle 578.15 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:57, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.11 SM Frequency cycle/nsecond 1.42 Elapsed Cycles cycle 5,085 Memory [%] % 8.84 DRAM Throughput % 8.84 Duration usecond 3.58 L1/TEX Cache Throughput % 10.78 L2 Cache Throughput % 7.82 SM Active Cycles cycle 1,327.40 Compute (SM) [%] % 2.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:58, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.27 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,462 Memory [%] % 3.40 DRAM Throughput % 1.89 Duration usecond 6.56 L1/TEX Cache Throughput % 23.48 L2 Cache Throughput % 3.40 SM Active Cycles cycle 622.18 Compute (SM) [%] % 1.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 7.79 Achieved Active Warps Per SM warp 3.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:58, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.69 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 5,020 Memory [%] % 8.95 DRAM Throughput % 8.95 Duration usecond 3.71 L1/TEX Cache Throughput % 10.47 L2 Cache Throughput % 7.95 SM Active Cycles cycle 1,366.59 Compute (SM) [%] % 2.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:58, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,308 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 25.23 L2 Cache Throughput % 3.46 SM Active Cycles cycle 579.07 Compute (SM) [%] % 1.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.39 Achieved Active Warps Per SM warp 4.03 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:56:58, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.80 SM Frequency cycle/nsecond 1.53 Elapsed Cycles cycle 5,485 Memory [%] % 8.22 DRAM Throughput % 8.22 Duration usecond 3.58 L1/TEX Cache Throughput % 10.29 L2 Cache Throughput % 7.23 SM Active Cycles cycle 1,390.19 Compute (SM) [%] % 2.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.59 Achieved Active Warps Per SM warp 7.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.17 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,104 Memory [%] % 4.94 DRAM Throughput % 1.57 Duration usecond 5.60 L1/TEX Cache Throughput % 23.87 L2 Cache Throughput % 4.94 SM Active Cycles cycle 945.35 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.23 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,260 Memory [%] % 4.84 DRAM Throughput % 1.54 Duration usecond 5.66 L1/TEX Cache Throughput % 23.53 L2 Cache Throughput % 4.84 SM Active Cycles cycle 959.60 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:56:59, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.13 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,410 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.62 L1/TEX Cache Throughput % 24.92 L2 Cache Throughput % 3.42 SM Active Cycles cycle 587.07 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.24 Achieved Active Warps Per SM warp 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:00, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.05 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 5,015 Memory [%] % 8.99 DRAM Throughput % 8.99 Duration usecond 3.55 L1/TEX Cache Throughput % 10.33 L2 Cache Throughput % 7.94 SM Active Cycles cycle 1,385.03 Compute (SM) [%] % 2.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.75 Achieved Active Warps Per SM warp 7.56 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:00, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.55 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 8,729 Memory [%] % 3.30 DRAM Throughput % 1.82 Duration usecond 6.56 L1/TEX Cache Throughput % 25.14 L2 Cache Throughput % 3.30 SM Active Cycles cycle 582.37 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:00, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.68 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,965 Memory [%] % 9.04 DRAM Throughput % 9.04 Duration usecond 3.68 L1/TEX Cache Throughput % 10.43 L2 Cache Throughput % 7.99 SM Active Cycles cycle 1,371.74 Compute (SM) [%] % 2.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.91 Achieved Active Warps Per SM warp 7.64 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:01, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,371 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.59 L1/TEX Cache Throughput % 24.85 L2 Cache Throughput % 3.44 SM Active Cycles cycle 588.25 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.23 Achieved Active Warps Per SM warp 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:01, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.41 SM Frequency cycle/nsecond 1.47 Elapsed Cycles cycle 5,447 Memory [%] % 8.27 DRAM Throughput % 8.27 Duration usecond 3.71 L1/TEX Cache Throughput % 10.55 L2 Cache Throughput % 7.31 SM Active Cycles cycle 1,356.51 Compute (SM) [%] % 2.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.16 Achieved Active Warps Per SM warp 7.76 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:01, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.03 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 8,328 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.66 L1/TEX Cache Throughput % 25.25 L2 Cache Throughput % 3.45 SM Active Cycles cycle 578.56 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:01, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.14 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 5,076 Memory [%] % 8.81 DRAM Throughput % 8.81 Duration usecond 3.58 L1/TEX Cache Throughput % 10.43 L2 Cache Throughput % 7.82 SM Active Cycles cycle 1,372.21 Compute (SM) [%] % 2.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.84 Achieved Active Warps Per SM warp 7.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.15 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,145 Memory [%] % 4.92 DRAM Throughput % 1.57 Duration usecond 5.63 L1/TEX Cache Throughput % 23.93 L2 Cache Throughput % 4.92 SM Active Cycles cycle 943.35 Compute (SM) [%] % 2.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.13 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,237 Memory [%] % 4.85 DRAM Throughput % 1.54 Duration usecond 5.73 L1/TEX Cache Throughput % 23.46 L2 Cache Throughput % 4.85 SM Active Cycles cycle 956.47 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:02, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.16 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,375 Memory [%] % 3.43 DRAM Throughput % 1.91 Duration usecond 6.59 L1/TEX Cache Throughput % 25.14 L2 Cache Throughput % 3.43 SM Active Cycles cycle 583.21 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:02, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.71 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,961 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.68 L1/TEX Cache Throughput % 10.23 L2 Cache Throughput % 8.02 SM Active Cycles cycle 1,398 Compute (SM) [%] % 2.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.00 Achieved Active Warps Per SM warp 7.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:03, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.28 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,377 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.50 L1/TEX Cache Throughput % 25.18 L2 Cache Throughput % 3.44 SM Active Cycles cycle 581.99 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:03, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.86 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 5,127 Memory [%] % 8.78 DRAM Throughput % 8.78 Duration usecond 3.71 L1/TEX Cache Throughput % 10.00 L2 Cache Throughput % 7.75 SM Active Cycles cycle 1,430.40 Compute (SM) [%] % 2.87 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.00 Achieved Active Warps Per SM warp 7.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:03, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.24 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,423 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.56 L1/TEX Cache Throughput % 25.15 L2 Cache Throughput % 3.42 SM Active Cycles cycle 582.43 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.36 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:04, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.65 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,947 Memory [%] % 9.06 DRAM Throughput % 9.06 Duration usecond 3.68 L1/TEX Cache Throughput % 10.64 L2 Cache Throughput % 8.04 SM Active Cycles cycle 1,344.26 Compute (SM) [%] % 2.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.33 Achieved Active Warps Per SM warp 7.84 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:04, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.21 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,353 Memory [%] % 3.45 DRAM Throughput % 1.92 Duration usecond 6.50 L1/TEX Cache Throughput % 25.33 L2 Cache Throughput % 3.45 SM Active Cycles cycle 579.53 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:04, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.94 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 4,945 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.55 L1/TEX Cache Throughput % 10.70 L2 Cache Throughput % 8.04 SM Active Cycles cycle 1,336.56 Compute (SM) [%] % 2.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:04, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,158 Memory [%] % 4.91 DRAM Throughput % 1.57 Duration usecond 5.60 L1/TEX Cache Throughput % 23.86 L2 Cache Throughput % 4.91 SM Active Cycles cycle 947.25 Compute (SM) [%] % 2.73 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:05, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.25 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 7,246 Memory [%] % 4.85 DRAM Throughput % 1.54 Duration usecond 5.63 L1/TEX Cache Throughput % 23.47 L2 Cache Throughput % 4.85 SM Active Cycles cycle 958.79 Compute (SM) [%] % 2.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:05, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.10 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 8,361 Memory [%] % 3.44 DRAM Throughput % 1.91 Duration usecond 6.62 L1/TEX Cache Throughput % 24.95 L2 Cache Throughput % 3.44 SM Active Cycles cycle 586.32 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:05, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.81 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,958 Memory [%] % 9.07 DRAM Throughput % 9.07 Duration usecond 3.62 L1/TEX Cache Throughput % 10.66 L2 Cache Throughput % 8.03 SM Active Cycles cycle 1,341.53 Compute (SM) [%] % 2.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.32 Achieved Active Warps Per SM warp 7.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:06, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.38 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,599 Memory [%] % 3.35 DRAM Throughput % 1.86 Duration usecond 6.56 L1/TEX Cache Throughput % 25.03 L2 Cache Throughput % 3.35 SM Active Cycles cycle 586.69 Compute (SM) [%] % 1.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.35 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:06, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 10 SM Frequency cycle/nsecond 1.55 Elapsed Cycles cycle 5,566 Memory [%] % 8.05 DRAM Throughput % 8.05 Duration usecond 3.58 L1/TEX Cache Throughput % 10.78 L2 Cache Throughput % 7.15 SM Active Cycles cycle 1,326.66 Compute (SM) [%] % 2.64 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.33 Achieved Active Warps Per SM warp 7.84 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:06, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.24 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,427 Memory [%] % 3.41 DRAM Throughput % 1.90 Duration usecond 6.56 L1/TEX Cache Throughput % 25.02 L2 Cache Throughput % 3.41 SM Active Cycles cycle 583.78 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.29 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:06, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.41 SM Frequency cycle/nsecond 1.46 Elapsed Cycles cycle 5,571 Memory [%] % 8.05 DRAM Throughput % 8.05 Duration usecond 3.81 L1/TEX Cache Throughput % 10.41 L2 Cache Throughput % 7.14 SM Active Cycles cycle 1,374.63 Compute (SM) [%] % 2.64 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.82 Achieved Active Warps Per SM warp 7.59 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:07, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.24 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,431 Memory [%] % 3.41 DRAM Throughput % 1.90 Duration usecond 6.56 L1/TEX Cache Throughput % 24.70 L2 Cache Throughput % 3.41 SM Active Cycles cycle 593.01 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:07, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.78 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 4,948 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.62 L1/TEX Cache Throughput % 10.69 L2 Cache Throughput % 8.11 SM Active Cycles cycle 1,338.62 Compute (SM) [%] % 2.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.87 Achieved Active Warps Per SM warp 7.62 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:07, Context 1, Stream 25 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.08 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 7,126 Memory [%] % 4.93 DRAM Throughput % 1.57 Duration usecond 5.66 L1/TEX Cache Throughput % 24.01 L2 Cache Throughput % 4.93 SM Active Cycles cycle 942.69 Compute (SM) [%] % 2.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.28 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:08, Context 1, Stream 27 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.21 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 7,226 Memory [%] % 4.86 DRAM Throughput % 1.55 Duration usecond 5.66 L1/TEX Cache Throughput % 23.29 L2 Cache Throughput % 4.86 SM Active Cycles cycle 952.53 Compute (SM) [%] % 2.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:08, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.04 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 8,391 Memory [%] % 3.43 DRAM Throughput % 1.91 Duration usecond 6.69 L1/TEX Cache Throughput % 25.12 L2 Cache Throughput % 3.43 SM Active Cycles cycle 583.34 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.31 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:08, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.08 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 4,980 Memory [%] % 9.04 DRAM Throughput % 9.04 Duration usecond 3.52 L1/TEX Cache Throughput % 10.55 L2 Cache Throughput % 8.03 SM Active Cycles cycle 1,355.75 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.15 Achieved Active Warps Per SM warp 7.75 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:09, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.29 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 8,418 Memory [%] % 3.42 DRAM Throughput % 1.90 Duration usecond 6.50 L1/TEX Cache Throughput % 25.11 L2 Cache Throughput % 3.42 SM Active Cycles cycle 582.22 Compute (SM) [%] % 1.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:09, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.70 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 4,937 Memory [%] % 9.09 DRAM Throughput % 9.09 Duration usecond 3.65 L1/TEX Cache Throughput % 10.43 L2 Cache Throughput % 8.08 SM Active Cycles cycle 1,371.62 Compute (SM) [%] % 2.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.16 Achieved Active Warps Per SM warp 7.76 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:09, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.42 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,558 Memory [%] % 3.36 DRAM Throughput % 1.87 Duration usecond 6.53 L1/TEX Cache Throughput % 23.55 L2 Cache Throughput % 3.36 SM Active Cycles cycle 623.51 Compute (SM) [%] % 1.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 7.86 Achieved Active Warps Per SM warp 3.77 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:09, Context 1, Stream 26 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.12 SM Frequency cycle/nsecond 1.42 Elapsed Cycles cycle 5,175 Memory [%] % 8.67 DRAM Throughput % 8.67 Duration usecond 3.65 L1/TEX Cache Throughput % 10.53 L2 Cache Throughput % 7.67 SM Active Cycles cycle 1,358.85 Compute (SM) [%] % 2.84 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.26 Achieved Active Warps Per SM warp 7.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:10, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.19 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 8,330 Memory [%] % 3.46 DRAM Throughput % 1.92 Duration usecond 6.53 L1/TEX Cache Throughput % 25.21 L2 Cache Throughput % 3.46 SM Active Cycles cycle 580.34 Compute (SM) [%] % 1.48 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 96 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:10, Context 1, Stream 28 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.03 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 5,063 Memory [%] % 8.84 DRAM Throughput % 8.84 Duration usecond 3.62 L1/TEX Cache Throughput % 10.29 L2 Cache Throughput % 7.87 SM Active Cycles cycle 1,390.15 Compute (SM) [%] % 2.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.06 Achieved Active Warps Per SM warp 7.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:10, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.82 SM Frequency cycle/nsecond 1.22 Elapsed Cycles cycle 5,105 Memory [%] % 3.10 DRAM Throughput % 2.51 Duration usecond 4.19 L1/TEX Cache Throughput % 11.63 L2 Cache Throughput % 2.63 SM Active Cycles cycle 1,360.07 Compute (SM) [%] % 10.68 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 18 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 25.7%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 74.28 Achieved Active Warps Per SM warp 35.65 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (74.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:10, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.40 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 5,084 Memory [%] % 2.40 DRAM Throughput % 0.01 Duration usecond 4.42 L1/TEX Cache Throughput % 3.21 L2 Cache Throughput % 2.40 SM Active Cycles cycle 1,483.26 Compute (SM) [%] % 1.22 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.03 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.19 Achieved Active Warps Per SM warp 2.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. ampere_sgemm_32x32_sliced1x4_tn, 2023-Apr-06 16:57:11, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.65 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 13,743 Memory [%] % 2.55 DRAM Throughput % 1.86 Duration usecond 10.18 L1/TEX Cache Throughput % 25.54 L2 Cache Throughput % 2.47 SM Active Cycles cycle 1,374.06 Compute (SM) [%] % 2.25 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 86 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 32.77 Threads thread 1,024 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 3 Block Limit Warps block 12 Theoretical Active Warps per SM warp 12 Theoretical Occupancy % 25 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (25.0%) is limited by the required amount of shared memory The difference between calculated theoretical (25.0%) and measured achieved occupancy (8.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array>(int, T2, T3), 2023-Apr-06 16:57:12, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.53 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 3,622 Memory [%] % 2.03 DRAM Throughput % 1.78 Duration usecond 3.07 L1/TEX Cache Throughput % 2.51 L2 Cache Throughput % 2.03 SM Active Cycles cycle 753.94 Compute (SM) [%] % 0.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 19 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.03 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 4.22 Achieved Active Warps Per SM warp 2.03 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (4.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:57:12, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.08 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 5,074 Memory [%] % 0.77 DRAM Throughput % 0.04 Duration usecond 4.03 L1/TEX Cache Throughput % 1.20 L2 Cache Throughput % 0.77 SM Active Cycles cycle 1,254.99 Compute (SM) [%] % 1.36 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.43 Achieved Active Warps Per SM warp 4.05 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:12, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.16 SM Frequency cycle/nsecond 1.42 Elapsed Cycles cycle 5,325 Memory [%] % 1.11 DRAM Throughput % 0.61 Duration usecond 3.74 L1/TEX Cache Throughput % 2.89 L2 Cache Throughput % 1.11 SM Active Cycles cycle 426.87 Compute (SM) [%] % 1.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.70 Achieved Active Warps Per SM warp 8.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.36 SM Frequency cycle/nsecond 1.12 Elapsed Cycles cycle 7,269 Memory [%] % 35.30 DRAM Throughput % 2.49 Duration usecond 6.50 L1/TEX Cache Throughput % 37.80 L2 Cache Throughput % 35.30 SM Active Cycles cycle 4,779.07 Compute (SM) [%] % 30.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 960 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 61,440 Waves Per SM 0.88 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 48.11 Achieved Active Warps Per SM warp 23.09 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (48.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.56 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 18,581 Memory [%] % 35.49 DRAM Throughput % 35.49 Duration usecond 14.27 L1/TEX Cache Throughput % 18.01 L2 Cache Throughput % 25.87 SM Active Cycles cycle 13,854.76 Compute (SM) [%] % 42.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 26 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 26.9%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 73.06 Achieved Active Warps Per SM warp 35.07 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (73.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.73 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 13,977 Memory [%] % 40.58 DRAM Throughput % 40.58 Duration usecond 10.43 L1/TEX Cache Throughput % 20.56 L2 Cache Throughput % 20.42 SM Active Cycles cycle 10,105.90 Compute (SM) [%] % 19.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.35 Achieved Active Warps Per SM warp 4.01 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, at::detail::Array, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:13, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.32 SM Frequency cycle/nsecond 1.13 Elapsed Cycles cycle 6,479 Memory [%] % 9.95 DRAM Throughput % 3.69 Duration usecond 5.70 L1/TEX Cache Throughput % 9.11 L2 Cache Throughput % 9.95 SM Active Cycles cycle 4,206.78 Compute (SM) [%] % 3.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 22 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 7.31 Achieved Active Warps Per SM warp 3.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (7.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array>(int, T2, T3), 2023-Apr-06 16:57:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.55 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 3,884 Memory [%] % 6.18 DRAM Throughput % 6.18 Duration usecond 3.30 L1/TEX Cache Throughput % 3.70 L2 Cache Throughput % 6.05 SM Active Cycles cycle 1,770.01 Compute (SM) [%] % 1.61 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 19 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 6.90 Achieved Active Warps Per SM warp 3.31 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 6.80 SM Frequency cycle/nsecond 1.06 Elapsed Cycles cycle 5,637 Memory [%] % 6.40 DRAM Throughput % 4.26 Duration usecond 5.31 L1/TEX Cache Throughput % 7.81 L2 Cache Throughput % 6.40 SM Active Cycles cycle 3,252.03 Compute (SM) [%] % 10.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 16 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 7 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 32.64 Achieved Active Warps Per SM warp 15.67 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (32.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void ::softmax_warp_forward(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:14, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.17 SM Frequency cycle/nsecond 1.27 Elapsed Cycles cycle 4,182 Memory [%] % 1.29 DRAM Throughput % 0.72 Duration usecond 3.30 L1/TEX Cache Throughput % 9.66 L2 Cache Throughput % 1.29 SM Active Cycles cycle 506.68 Compute (SM) [%] % 1.17 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 21 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 21 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.37 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void gemv2N_kernel, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float>>(T13), 2023-Apr-06 16:57:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.88 SM Frequency cycle/nsecond 1.36 Elapsed Cycles cycle 11,881 Memory [%] % 32.04 DRAM Throughput % 32.04 Duration usecond 8.70 L1/TEX Cache Throughput % 22.51 L2 Cache Throughput % 14.94 SM Active Cycles cycle 9,631.32 Compute (SM) [%] % 25.22 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1,024 Registers Per Thread register/thread 45 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 2.56 Threads thread 131,072 Waves Per SM 1.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 25.8%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 10 Block Limit Shared Mem block 18 Block Limit Warps block 12 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 61.86 Achieved Active Warps Per SM warp 29.69 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (61.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:15, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.61 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 5,189 Memory [%] % 3.20 DRAM Throughput % 3.08 Duration usecond 4.38 L1/TEX Cache Throughput % 9.85 L2 Cache Throughput % 3.19 SM Active Cycles cycle 1,681.49 Compute (SM) [%] % 11.21 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 24 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 38.1%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 61.87 Achieved Active Warps Per SM warp 29.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (61.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:15, Context 1, Stream 29 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.54 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 11,764 Memory [%] % 4.49 DRAM Throughput % 3.39 Duration usecond 8.83 L1/TEX Cache Throughput % 25.76 L2 Cache Throughput % 4.49 SM Active Cycles cycle 890.47 Compute (SM) [%] % 1.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.24 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:15, Context 1, Stream 30 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.33 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 8,415 Memory [%] % 3.39 DRAM Throughput % 1.90 Duration usecond 6.46 L1/TEX Cache Throughput % 26.87 L2 Cache Throughput % 3.39 SM Active Cycles cycle 588.82 Compute (SM) [%] % 1.40 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.33 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:16, Context 1, Stream 30 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.43 SM Frequency cycle/nsecond 1.46 Elapsed Cycles cycle 5,255 Memory [%] % 8.54 DRAM Throughput % 8.54 Duration usecond 3.58 L1/TEX Cache Throughput % 10.85 L2 Cache Throughput % 7.60 SM Active Cycles cycle 1,319.07 Compute (SM) [%] % 2.80 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.94 Achieved Active Warps Per SM warp 7.65 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.78 SM Frequency cycle/nsecond 1.21 Elapsed Cycles cycle 5,102 Memory [%] % 4.80 DRAM Throughput % 4.37 Duration usecond 4.22 L1/TEX Cache Throughput % 12.18 L2 Cache Throughput % 4.39 SM Active Cycles cycle 2,009.10 Compute (SM) [%] % 16.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 18 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 208,896 Waves Per SM 2 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 66.65 Achieved Active Warps Per SM warp 31.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (66.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.47 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 21,453 Memory [%] % 72.91 DRAM Throughput % 49.62 Duration usecond 16.70 L1/TEX Cache Throughput % 74.73 L2 Cache Throughput % 72.91 SM Active Cycles cycle 18,766.71 Compute (SM) [%] % 45.24 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 5,419 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 346,816 Waves Per SM 4.98 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 54.75 Achieved Active Warps Per SM warp 26.28 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (54.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:16, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.92 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 96,602 Memory [%] % 37.21 DRAM Throughput % 37.21 Duration usecond 70.18 L1/TEX Cache Throughput % 29.19 L2 Cache Throughput % 26.57 SM Active Cycles cycle 61,162.43 Compute (SM) [%] % 18.59 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 88 Registers Per Thread register/thread 250 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 98.30 Static Shared Memory Per Block byte/block 0 Threads thread 11,264 Waves Per SM 1.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 2 Block Limit Shared Mem block 1 Block Limit Warps block 12 Theoretical Active Warps per SM warp 4 Theoretical Occupancy % 8.33 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.86 SM Frequency cycle/nsecond 1.20 Elapsed Cycles cycle 20,023 Memory [%] % 53.27 DRAM Throughput % 53.27 Duration usecond 16.58 L1/TEX Cache Throughput % 15.11 L2 Cache Throughput % 24.38 SM Active Cycles cycle 16,883.25 Compute (SM) [%] % 36.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 128 Registers Per Thread register/thread 40 Shared Memory Configuration Size Kbyte 32.77 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 8.19 Static Shared Memory Per Block byte/block 16 Threads thread 65,536 Waves Per SM 0.63 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 3 Block Limit Shared Mem block 3 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 60.89 Achieved Active Warps Per SM warp 29.23 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (60.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:57:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.52 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 4,759 Memory [%] % 0.93 DRAM Throughput % 0.15 Duration usecond 4.03 L1/TEX Cache Throughput % 1.17 L2 Cache Throughput % 0.93 SM Active Cycles cycle 1,285.32 Compute (SM) [%] % 1.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.18 Achieved Active Warps Per SM warp 3.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:17, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.03 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 5,299 Memory [%] % 1.11 DRAM Throughput % 0.61 Duration usecond 3.78 L1/TEX Cache Throughput % 3.00 L2 Cache Throughput % 1.11 SM Active Cycles cycle 411.87 Compute (SM) [%] % 1.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.59 Achieved Active Warps Per SM warp 7.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.6%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.37 SM Frequency cycle/nsecond 1.12 Elapsed Cycles cycle 7,242 Memory [%] % 35.45 DRAM Throughput % 2.24 Duration usecond 6.46 L1/TEX Cache Throughput % 37.96 L2 Cache Throughput % 35.45 SM Active Cycles cycle 4,840.43 Compute (SM) [%] % 30.56 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 960 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 61,440 Waves Per SM 0.88 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 48.14 Achieved Active Warps Per SM warp 23.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (48.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.51 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 18,660 Memory [%] % 35.38 DRAM Throughput % 35.38 Duration usecond 14.43 L1/TEX Cache Throughput % 17.91 L2 Cache Throughput % 25.73 SM Active Cycles cycle 13,814.72 Compute (SM) [%] % 42.49 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 26 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 26.6%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 73.45 Achieved Active Warps Per SM warp 35.25 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (73.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.65 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 14,013 Memory [%] % 40.58 DRAM Throughput % 40.58 Duration usecond 10.53 L1/TEX Cache Throughput % 20.21 L2 Cache Throughput % 20.43 SM Active Cycles cycle 10,277.59 Compute (SM) [%] % 19.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.27 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, at::detail::Array, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:18, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.33 SM Frequency cycle/nsecond 1.14 Elapsed Cycles cycle 6,413 Memory [%] % 10.05 DRAM Throughput % 3.72 Duration usecond 5.63 L1/TEX Cache Throughput % 9.20 L2 Cache Throughput % 10.05 SM Active Cycles cycle 4,185.32 Compute (SM) [%] % 3.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 22 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 7.24 Achieved Active Warps Per SM warp 3.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array>(int, T2, T3), 2023-Apr-06 16:57:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.64 SM Frequency cycle/nsecond 1.19 Elapsed Cycles cycle 3,918 Memory [%] % 6.10 DRAM Throughput % 6.10 Duration usecond 3.30 L1/TEX Cache Throughput % 3.61 L2 Cache Throughput % 5.99 SM Active Cycles cycle 1,816.96 Compute (SM) [%] % 1.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 19 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 6.68 Achieved Active Warps Per SM warp 3.20 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (6.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 6.74 SM Frequency cycle/nsecond 1.05 Elapsed Cycles cycle 5,603 Memory [%] % 6.43 DRAM Throughput % 4.27 Duration usecond 5.34 L1/TEX Cache Throughput % 8.07 L2 Cache Throughput % 6.43 SM Active Cycles cycle 3,150.84 Compute (SM) [%] % 10.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 16 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 7 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 33.02 Achieved Active Warps Per SM warp 15.85 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (33.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void ::softmax_warp_forward(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:19, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.98 SM Frequency cycle/nsecond 1.24 Elapsed Cycles cycle 4,092 Memory [%] % 1.32 DRAM Throughput % 0.74 Duration usecond 3.30 L1/TEX Cache Throughput % 9.77 L2 Cache Throughput % 1.32 SM Active Cycles cycle 501.09 Compute (SM) [%] % 1.20 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 21 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 21 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.38 Achieved Active Warps Per SM warp 4.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void gemv2N_kernel, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float>>(T13), 2023-Apr-06 16:57:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.22 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 11,975 Memory [%] % 31.79 DRAM Throughput % 31.79 Duration usecond 8.45 L1/TEX Cache Throughput % 22.70 L2 Cache Throughput % 14.82 SM Active Cycles cycle 9,554.74 Compute (SM) [%] % 25.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1,024 Registers Per Thread register/thread 45 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 2.56 Threads thread 131,072 Waves Per SM 1.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 25.0%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 10 Block Limit Shared Mem block 18 Block Limit Warps block 12 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 62.46 Achieved Active Warps Per SM warp 29.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (62.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:20, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.80 SM Frequency cycle/nsecond 1.21 Elapsed Cycles cycle 5,247 Memory [%] % 3.16 DRAM Throughput % 3.05 Duration usecond 4.32 L1/TEX Cache Throughput % 9.67 L2 Cache Throughput % 3.14 SM Active Cycles cycle 1,713.79 Compute (SM) [%] % 11.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 24 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 37.1%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 62.92 Achieved Active Warps Per SM warp 30.20 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (62.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:20, Context 1, Stream 31 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.63 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 11,851 Memory [%] % 4.46 DRAM Throughput % 3.36 Duration usecond 8.83 L1/TEX Cache Throughput % 26.00 L2 Cache Throughput % 4.46 SM Active Cycles cycle 882.29 Compute (SM) [%] % 1.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:21, Context 1, Stream 32 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.25 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,382 Memory [%] % 3.40 DRAM Throughput % 1.90 Duration usecond 6.53 L1/TEX Cache Throughput % 25.84 L2 Cache Throughput % 3.40 SM Active Cycles cycle 611.31 Compute (SM) [%] % 1.40 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 7.92 Achieved Active Warps Per SM warp 3.80 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:21, Context 1, Stream 32 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.62 SM Frequency cycle/nsecond 1.34 Elapsed Cycles cycle 4,893 Memory [%] % 9.18 DRAM Throughput % 9.18 Duration usecond 3.65 L1/TEX Cache Throughput % 10.64 L2 Cache Throughput % 8.16 SM Active Cycles cycle 1,344.53 Compute (SM) [%] % 3.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 15.54 Achieved Active Warps Per SM warp 7.46 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (15.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.43 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 4,885 Memory [%] % 5.01 DRAM Throughput % 4.54 Duration usecond 4.26 L1/TEX Cache Throughput % 12.67 L2 Cache Throughput % 4.59 SM Active Cycles cycle 1,931.32 Compute (SM) [%] % 16.80 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 18 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 208,896 Waves Per SM 2 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 67.24 Achieved Active Warps Per SM warp 32.28 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (67.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.53 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 21,655 Memory [%] % 72.27 DRAM Throughput % 48.57 Duration usecond 16.74 L1/TEX Cache Throughput % 74.03 L2 Cache Throughput % 72.27 SM Active Cycles cycle 18,808.60 Compute (SM) [%] % 44.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 5,419 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 346,816 Waves Per SM 4.98 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 54.75 Achieved Active Warps Per SM warp 26.28 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (54.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:21, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.07 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 96,474 Memory [%] % 37.23 DRAM Throughput % 37.23 Duration usecond 68.99 L1/TEX Cache Throughput % 29.07 L2 Cache Throughput % 26.60 SM Active Cycles cycle 61,365.76 Compute (SM) [%] % 18.61 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 88 Registers Per Thread register/thread 250 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 98.30 Static Shared Memory Per Block byte/block 0 Threads thread 11,264 Waves Per SM 1.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 2 Block Limit Shared Mem block 1 Block Limit Warps block 12 Theoretical Active Warps per SM warp 4 Theoretical Occupancy % 8.33 Achieved Occupancy % 8.32 Achieved Active Warps Per SM warp 3.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.79 SM Frequency cycle/nsecond 1.19 Elapsed Cycles cycle 19,703 Memory [%] % 54.12 DRAM Throughput % 54.12 Duration usecond 16.45 L1/TEX Cache Throughput % 15.03 L2 Cache Throughput % 24.81 SM Active Cycles cycle 16,976.29 Compute (SM) [%] % 37.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 128 Registers Per Thread register/thread 40 Shared Memory Configuration Size Kbyte 32.77 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 8.19 Static Shared Memory Per Block byte/block 16 Threads thread 65,536 Waves Per SM 0.63 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 3 Block Limit Shared Mem block 3 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 60.75 Achieved Active Warps Per SM warp 29.16 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (60.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:57:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.04 SM Frequency cycle/nsecond 1.25 Elapsed Cycles cycle 5,154 Memory [%] % 0.96 DRAM Throughput % 0.31 Duration usecond 4.13 L1/TEX Cache Throughput % 1.18 L2 Cache Throughput % 0.96 SM Active Cycles cycle 1,273.43 Compute (SM) [%] % 1.34 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.10 Achieved Active Warps Per SM warp 3.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:22, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.26 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 4,812 Memory [%] % 1.22 DRAM Throughput % 0.67 Duration usecond 3.74 L1/TEX Cache Throughput % 2.86 L2 Cache Throughput % 1.22 SM Active Cycles cycle 431.44 Compute (SM) [%] % 1.17 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.46 Achieved Active Warps Per SM warp 7.90 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.19 SM Frequency cycle/nsecond 1.09 Elapsed Cycles cycle 7,224 Memory [%] % 35.55 DRAM Throughput % 2.37 Duration usecond 6.59 L1/TEX Cache Throughput % 38.06 L2 Cache Throughput % 35.55 SM Active Cycles cycle 4,768.66 Compute (SM) [%] % 30.74 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 960 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 61,440 Waves Per SM 0.88 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 48.86 Achieved Active Warps Per SM warp 23.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (48.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.59 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 18,607 Memory [%] % 34.82 DRAM Throughput % 34.82 Duration usecond 14.27 L1/TEX Cache Throughput % 17.97 L2 Cache Throughput % 25.77 SM Active Cycles cycle 13,854.60 Compute (SM) [%] % 42.62 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 26 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 26.6%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 73.37 Achieved Active Warps Per SM warp 35.22 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (73.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.58 SM Frequency cycle/nsecond 1.32 Elapsed Cycles cycle 14,003 Memory [%] % 40.63 DRAM Throughput % 40.63 Duration usecond 10.59 L1/TEX Cache Throughput % 20.20 L2 Cache Throughput % 20.42 SM Active Cycles cycle 10,282.13 Compute (SM) [%] % 19.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.22 Achieved Active Warps Per SM warp 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, at::detail::Array, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:23, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.13 SM Frequency cycle/nsecond 1.11 Elapsed Cycles cycle 6,401 Memory [%] % 10.01 DRAM Throughput % 3.74 Duration usecond 5.76 L1/TEX Cache Throughput % 9.21 L2 Cache Throughput % 10.01 SM Active Cycles cycle 4,229.12 Compute (SM) [%] % 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 22 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 7.18 Achieved Active Warps Per SM warp 3.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array>(int, T2, T3), 2023-Apr-06 16:57:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.37 SM Frequency cycle/nsecond 1.15 Elapsed Cycles cycle 3,896 Memory [%] % 6.16 DRAM Throughput % 6.16 Duration usecond 3.39 L1/TEX Cache Throughput % 3.71 L2 Cache Throughput % 5.80 SM Active Cycles cycle 1,765.78 Compute (SM) [%] % 1.61 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 19 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 6.93 Achieved Active Warps Per SM warp 3.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 6.97 SM Frequency cycle/nsecond 1.08 Elapsed Cycles cycle 5,815 Memory [%] % 6.16 DRAM Throughput % 4.11 Duration usecond 5.38 L1/TEX Cache Throughput % 7.85 L2 Cache Throughput % 6.16 SM Active Cycles cycle 3,235.85 Compute (SM) [%] % 10.12 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 16 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 7 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 33.16 Achieved Active Warps Per SM warp 15.92 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (33.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void ::softmax_warp_forward(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:24, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.08 SM Frequency cycle/nsecond 1.26 Elapsed Cycles cycle 4,150 Memory [%] % 1.30 DRAM Throughput % 0.73 Duration usecond 3.30 L1/TEX Cache Throughput % 9.49 L2 Cache Throughput % 1.30 SM Active Cycles cycle 515.75 Compute (SM) [%] % 1.18 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 21 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 21 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.18 Achieved Active Warps Per SM warp 3.93 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void gemv2N_kernel, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float>>(T13), 2023-Apr-06 16:57:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.93 SM Frequency cycle/nsecond 1.37 Elapsed Cycles cycle 11,735 Memory [%] % 32.47 DRAM Throughput % 32.47 Duration usecond 8.54 L1/TEX Cache Throughput % 22.77 L2 Cache Throughput % 15.16 SM Active Cycles cycle 9,522.75 Compute (SM) [%] % 25.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1,024 Registers Per Thread register/thread 45 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 2.56 Threads thread 131,072 Waves Per SM 1.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 24.5%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 10 Block Limit Shared Mem block 18 Block Limit Warps block 12 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 62.94 Achieved Active Warps Per SM warp 30.21 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (62.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:25, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.04 SM Frequency cycle/nsecond 1.24 Elapsed Cycles cycle 5,377 Memory [%] % 3.09 DRAM Throughput % 2.96 Duration usecond 4.32 L1/TEX Cache Throughput % 9.85 L2 Cache Throughput % 3.07 SM Active Cycles cycle 1,681.04 Compute (SM) [%] % 10.81 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 24 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 38.3%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 61.66 Achieved Active Warps Per SM warp 29.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (61.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:25, Context 1, Stream 33 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.89 SM Frequency cycle/nsecond 1.39 Elapsed Cycles cycle 12,251 Memory [%] % 4.31 DRAM Throughput % 3.26 Duration usecond 8.83 L1/TEX Cache Throughput % 25.72 L2 Cache Throughput % 4.31 SM Active Cycles cycle 892.06 Compute (SM) [%] % 1.84 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.34 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:26, Context 1, Stream 34 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.40 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 8,448 Memory [%] % 3.38 DRAM Throughput % 1.89 Duration usecond 6.46 L1/TEX Cache Throughput % 26.81 L2 Cache Throughput % 3.38 SM Active Cycles cycle 588.91 Compute (SM) [%] % 1.39 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.24 Achieved Active Warps Per SM warp 3.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:26, Context 1, Stream 34 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.45 SM Frequency cycle/nsecond 1.32 Elapsed Cycles cycle 4,819 Memory [%] % 9.37 DRAM Throughput % 9.37 Duration usecond 3.65 L1/TEX Cache Throughput % 11.16 L2 Cache Throughput % 8.29 SM Active Cycles cycle 1,282.44 Compute (SM) [%] % 3.05 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.05 Achieved Active Warps Per SM warp 7.70 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.0%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.81 SM Frequency cycle/nsecond 1.21 Elapsed Cycles cycle 5,014 Memory [%] % 4.88 DRAM Throughput % 4.45 Duration usecond 4.13 L1/TEX Cache Throughput % 12.32 L2 Cache Throughput % 4.50 SM Active Cycles cycle 1,985.81 Compute (SM) [%] % 16.32 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 18 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 208,896 Waves Per SM 2 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 68.05 Achieved Active Warps Per SM warp 32.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (68.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.62 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 21,565 Memory [%] % 72.38 DRAM Throughput % 48.70 Duration usecond 16.51 L1/TEX Cache Throughput % 74.29 L2 Cache Throughput % 72.38 SM Active Cycles cycle 18,796.43 Compute (SM) [%] % 44.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 5,419 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 346,816 Waves Per SM 4.98 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 54.69 Achieved Active Warps Per SM warp 26.25 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (54.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:26, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.98 SM Frequency cycle/nsecond 1.38 Elapsed Cycles cycle 96,009 Memory [%] % 37.59 DRAM Throughput % 37.59 Duration usecond 69.34 L1/TEX Cache Throughput % 29.21 L2 Cache Throughput % 26.75 SM Active Cycles cycle 61,011.24 Compute (SM) [%] % 18.71 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 88 Registers Per Thread register/thread 250 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 98.30 Static Shared Memory Per Block byte/block 0 Threads thread 11,264 Waves Per SM 1.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 2 Block Limit Shared Mem block 1 Block Limit Warps block 12 Theoretical Active Warps per SM warp 4 Theoretical Occupancy % 8.33 Achieved Occupancy % 8.34 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:27, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.72 SM Frequency cycle/nsecond 1.18 Elapsed Cycles cycle 19,602 Memory [%] % 54.45 DRAM Throughput % 54.45 Duration usecond 16.51 L1/TEX Cache Throughput % 15.07 L2 Cache Throughput % 24.94 SM Active Cycles cycle 16,925.59 Compute (SM) [%] % 37.24 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 128 Registers Per Thread register/thread 40 Shared Memory Configuration Size Kbyte 32.77 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 8.19 Static Shared Memory Per Block byte/block 16 Threads thread 65,536 Waves Per SM 0.63 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 3 Block Limit Shared Mem block 3 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 60.95 Achieved Active Warps Per SM warp 29.25 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (60.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:57:27, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.69 SM Frequency cycle/nsecond 1.20 Elapsed Cycles cycle 4,780 Memory [%] % 1.05 DRAM Throughput % 0.40 Duration usecond 3.97 L1/TEX Cache Throughput % 1.16 L2 Cache Throughput % 1.05 SM Active Cycles cycle 1,293.65 Compute (SM) [%] % 1.45 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.14 Achieved Active Warps Per SM warp 3.91 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:27, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.33 SM Frequency cycle/nsecond 1.45 Elapsed Cycles cycle 5,480 Memory [%] % 1.08 DRAM Throughput % 0.59 Duration usecond 3.78 L1/TEX Cache Throughput % 3.00 L2 Cache Throughput % 1.08 SM Active Cycles cycle 412.06 Compute (SM) [%] % 1.03 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.65 Achieved Active Warps Per SM warp 7.99 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.7%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:28, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.50 SM Frequency cycle/nsecond 1.14 Elapsed Cycles cycle 7,346 Memory [%] % 34.80 DRAM Throughput % 2.29 Duration usecond 6.43 L1/TEX Cache Throughput % 37.36 L2 Cache Throughput % 34.80 SM Active Cycles cycle 4,835.15 Compute (SM) [%] % 30.09 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 960 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 61,440 Waves Per SM 0.88 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 47.86 Achieved Active Warps Per SM warp 22.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (47.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:28, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.68 SM Frequency cycle/nsecond 1.31 Elapsed Cycles cycle 18,747 Memory [%] % 34.72 DRAM Throughput % 34.72 Duration usecond 14.21 L1/TEX Cache Throughput % 17.82 L2 Cache Throughput % 25.63 SM Active Cycles cycle 13,817.29 Compute (SM) [%] % 42.30 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 26 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 27.1%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 72.91 Achieved Active Warps Per SM warp 35.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (72.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:28, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.71 SM Frequency cycle/nsecond 1.33 Elapsed Cycles cycle 13,942 Memory [%] % 40.67 DRAM Throughput % 40.67 Duration usecond 10.43 L1/TEX Cache Throughput % 20.53 L2 Cache Throughput % 20.47 SM Active Cycles cycle 10,126.29 Compute (SM) [%] % 19.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.44 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.97 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, at::detail::Array, OffsetCalculator<(int)2, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:28, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.19 SM Frequency cycle/nsecond 1.12 Elapsed Cycles cycle 6,405 Memory [%] % 10.07 DRAM Throughput % 3.73 Duration usecond 5.73 L1/TEX Cache Throughput % 9.20 L2 Cache Throughput % 10.07 SM Active Cycles cycle 4,169.43 Compute (SM) [%] % 3.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 22 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 7.23 Achieved Active Warps Per SM warp 3.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (7.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array>(int, T2, T3), 2023-Apr-06 16:57:29, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.90 SM Frequency cycle/nsecond 1.23 Elapsed Cycles cycle 3,934 Memory [%] % 6.09 DRAM Throughput % 6.09 Duration usecond 3.20 L1/TEX Cache Throughput % 3.74 L2 Cache Throughput % 5.77 SM Active Cycles cycle 1,751.49 Compute (SM) [%] % 1.60 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 120 Registers Per Thread register/thread 19 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 7,680 Waves Per SM 0.11 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 6.87 Achieved Active Warps Per SM warp 3.30 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (6.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:29, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 6.80 SM Frequency cycle/nsecond 1.05 Elapsed Cycles cycle 5,627 Memory [%] % 6.37 DRAM Throughput % 4.23 Duration usecond 5.34 L1/TEX Cache Throughput % 8.10 L2 Cache Throughput % 6.37 SM Active Cycles cycle 3,136.24 Compute (SM) [%] % 10.47 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.3 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 60 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 16 Threads thread 30,720 Waves Per SM 0.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 60 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 4 Block Limit Shared Mem block 7 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 33.06 Achieved Active Warps Per SM warp 15.87 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (33.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void ::softmax_warp_forward(T2 *, const T1 *, int, int, int), 2023-Apr-06 16:57:29, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.30 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 4,266 Memory [%] % 1.27 DRAM Throughput % 0.71 Duration usecond 3.30 L1/TEX Cache Throughput % 9.67 L2 Cache Throughput % 1.27 SM Active Cycles cycle 506.19 Compute (SM) [%] % 1.15 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 21 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 2,048 Waves Per SM 0.02 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 21 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.22 Achieved Active Warps Per SM warp 3.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void gemv2N_kernel, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float>>(T13), 2023-Apr-06 16:57:30, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.14 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 11,988 Memory [%] % 31.83 DRAM Throughput % 31.83 Duration usecond 8.51 L1/TEX Cache Throughput % 22.47 L2 Cache Throughput % 14.81 SM Active Cycles cycle 9,649 Compute (SM) [%] % 24.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 1,024 Registers Per Thread register/thread 45 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 2.56 Threads thread 131,072 Waves Per SM 1.51 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 344 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 25.5%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 10 Block Limit Shared Mem block 18 Block Limit Warps block 12 Theoretical Active Warps per SM warp 40 Theoretical Occupancy % 83.33 Achieved Occupancy % 62.06 Achieved Active Warps Per SM warp 29.79 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (83.3%) is limited by the number of required registers The difference between calculated theoretical (83.3%) and measured achieved occupancy (62.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:30, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.76 SM Frequency cycle/nsecond 1.20 Elapsed Cycles cycle 5,227 Memory [%] % 3.17 DRAM Throughput % 3.04 Duration usecond 4.35 L1/TEX Cache Throughput % 9.62 L2 Cache Throughput % 3.13 SM Active Cycles cycle 1,722.03 Compute (SM) [%] % 11.15 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 272 Registers Per Thread register/thread 24 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 139,264 Waves Per SM 1.33 ---------------------------------------------------------------------- --------------- ------------------------------ WRN A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 67 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 37.9%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 62.14 Achieved Active Warps Per SM warp 29.83 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (62.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:30, Context 1, Stream 35 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.66 SM Frequency cycle/nsecond 1.35 Elapsed Cycles cycle 11,965 Memory [%] % 4.42 DRAM Throughput % 3.34 Duration usecond 8.86 L1/TEX Cache Throughput % 25.55 L2 Cache Throughput % 4.42 SM Active Cycles cycle 897.94 Compute (SM) [%] % 1.89 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.30 Achieved Active Warps Per SM warp 3.98 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:30, Context 1, Stream 36 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.27 SM Frequency cycle/nsecond 1.28 Elapsed Cycles cycle 8,307 Memory [%] % 3.43 DRAM Throughput % 1.92 Duration usecond 6.46 L1/TEX Cache Throughput % 27.38 L2 Cache Throughput % 3.43 SM Active Cycles cycle 577.85 Compute (SM) [%] % 1.42 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 8 Registers Per Thread register/thread 90 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 49.15 Static Shared Memory Per Block byte/block 0 Threads thread 1,024 Waves Per SM 0.06 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 8 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 12 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 16.67 Achieved Occupancy % 8.34 Achieved Active Warps Per SM warp 4.00 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (16.7%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void GRU_elementWise_fp(int, int, int, int, const T1 *, const T1 *, const T1 *, const T1 *, cudnn::reduced_divisor, T1 *, const T2 *, T2 *, T1 *, bool, bool, int), 2023-Apr-06 16:57:31, Context 1, Stream 36 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.19 SM Frequency cycle/nsecond 1.43 Elapsed Cycles cycle 4,979 Memory [%] % 9.01 DRAM Throughput % 9.01 Duration usecond 3.49 L1/TEX Cache Throughput % 10.79 L2 Cache Throughput % 8.04 SM Active Cycles cycle 1,325.50 Compute (SM) [%] % 2.95 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 30 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 8,192 Waves Per SM 0.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.21 Achieved Active Warps Per SM warp 7.78 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::CatArrayBatchedCopy(T1 *, at::native::::CatArrInputTensorMetadata, at::native::::TensorSizeStride, int, T2), 2023-Apr-06 16:57:31, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.42 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 5,537 Memory [%] % 4.43 DRAM Throughput % 4.01 Duration usecond 4.26 L1/TEX Cache Throughput % 12.45 L2 Cache Throughput % 4.04 SM Active Cycles cycle 1,965.16 Compute (SM) [%] % 14.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 408 Registers Per Thread register/thread 18 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 208,896 Waves Per SM 2 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 8 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 66.38 Achieved Active Warps Per SM warp 31.86 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (66.4%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:31, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.48 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 21,556 Memory [%] % 73.22 DRAM Throughput % 49.62 Duration usecond 16.61 L1/TEX Cache Throughput % 74.44 L2 Cache Throughput % 73.22 SM Active Cycles cycle 18,925.29 Compute (SM) [%] % 45.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 5,419 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 346,816 Waves Per SM 4.98 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 54.33 Achieved Active Warps Per SM warp 26.08 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (54.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void cutlass::Kernel(T1::Params), 2023-Apr-06 16:57:31, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 9.10 SM Frequency cycle/nsecond 1.40 Elapsed Cycles cycle 96,961 Memory [%] % 37.15 DRAM Throughput % 37.15 Duration usecond 69.12 L1/TEX Cache Throughput % 28.94 L2 Cache Throughput % 26.52 SM Active Cycles cycle 61,675.54 Compute (SM) [%] % 18.52 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 88 Registers Per Thread register/thread 250 Shared Memory Configuration Size Kbyte 102.40 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 98.30 Static Shared Memory Per Block byte/block 0 Threads thread 11,264 Waves Per SM 1.29 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 2 Block Limit Shared Mem block 1 Block Limit Warps block 12 Theoretical Active Warps per SM warp 4 Theoretical Occupancy % 8.33 Achieved Occupancy % 8.24 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (8.3%) is limited by the required amount of shared memory See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp, unsigned int, float, (int)4>>(T3), 2023-Apr-06 16:57:32, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.88 SM Frequency cycle/nsecond 1.20 Elapsed Cycles cycle 19,839 Memory [%] % 53.74 DRAM Throughput % 53.74 Duration usecond 16.38 L1/TEX Cache Throughput % 15.01 L2 Cache Throughput % 24.61 SM Active Cycles cycle 17,002.96 Compute (SM) [%] % 36.82 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.6 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 512 Function Cache Configuration cudaFuncCachePreferNone Grid Size 128 Registers Per Thread register/thread 40 Shared Memory Configuration Size Kbyte 32.77 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block Kbyte/block 8.19 Static Shared Memory Per Block byte/block 16 Threads thread 65,536 Waves Per SM 0.63 ---------------------------------------------------------------------- --------------- ------------------------------ WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the hardware busy. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 3 Block Limit Shared Mem block 3 Block Limit Warps block 3 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 60.83 Achieved Active Warps Per SM warp 29.20 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (60.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::indexSelectLargeIndex(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, int, int, T3, T3, long), 2023-Apr-06 16:57:32, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.29 SM Frequency cycle/nsecond 1.29 Elapsed Cycles cycle 5,240 Memory [%] % 0.98 DRAM Throughput % 0.62 Duration usecond 4.06 L1/TEX Cache Throughput % 1.19 L2 Cache Throughput % 0.98 SM Active Cycles cycle 1,268.40 Compute (SM) [%] % 1.32 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 128 Function Cache Configuration cudaFuncCachePreferNone Grid Size 32 Registers Per Thread register/thread 32 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 32 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 16 Block Limit Shared Mem block 16 Block Limit Warps block 12 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 8.26 Achieved Active Warps Per SM warp 3.96 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (8.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::::fused_dropout_kernel_vec(at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, at::cuda::detail::TensorInfo, T3, T2, at::PhiloxCudaState), 2023-Apr-06 16:57:32, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 8.38 SM Frequency cycle/nsecond 1.30 Elapsed Cycles cycle 4,791 Memory [%] % 1.23 DRAM Throughput % 0.84 Duration usecond 3.68 L1/TEX Cache Throughput % 2.88 L2 Cache Throughput % 1.23 SM Active Cycles cycle 428.28 Compute (SM) [%] % 1.18 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 256 Function Cache Configuration cudaFuncCachePreferNone Grid Size 16 Registers Per Thread register/thread 28 Shared Memory Configuration Size Kbyte 8.19 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 4,096 Waves Per SM 0.04 ---------------------------------------------------------------------- --------------- ------------------------------ WRN The grid for this launch is configured to execute only 16 blocks, which is less than the GPU's 68 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 8 Block Limit Shared Mem block 8 Block Limit Warps block 6 Theoretical Active Warps per SM warp 48 Theoretical Occupancy % 100 Achieved Occupancy % 16.54 Achieved Active Warps Per SM warp 7.94 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (16.5%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy. void at::native::unrolled_elementwise_kernel, OffsetCalculator<(int)1, unsigned int>, OffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T3, T4, T5, T6), 2023-Apr-06 16:57:33, Context 1, Stream 7 Section: GPU Speed Of Light Throughput ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 7.36 SM Frequency cycle/nsecond 1.12 Elapsed Cycles cycle 7,276 Memory [%] % 35.26 DRAM Throughput % 2.57 Duration usecond 6.50 L1/TEX Cache Throughput % 37.75 L2 Cache Throughput % 35.26 SM Active Cycles cycle 4,770.31 Compute (SM) [%] % 30.43 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.9 full waves across all SMs. Look at Launch Statistics for more details. Section: Launch Statistics ---------------------------------------------------------------------- --------------- ------------------------------ Block Size 64 Function Cache Configuration cudaFuncCachePreferNone Grid Size 960 Registers Per Thread register/thread 20 Shared Memory Configuration Size Kbyte 16.38 Driver Shared Memory Per Block Kbyte/block 1.02 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 61,440 Waves Per SM 0.88 ---------------------------------------------------------------------- --------------- ------------------------------ Section: Occupancy ---------------------------------------------------------------------- --------------- ------------------------------ Block Limit SM block 16 Block Limit Registers block 42 Block Limit Shared Mem block 16 Block Limit Warps block 24 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 66.67 Achieved Occupancy % 47.21 Achieved Active Warps Per SM warp 22.66 ---------------------------------------------------------------------- --------------- ------------------------------ WRN This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared memory This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that can fit on the SM The difference between calculated theoretical (66.7%) and measured achieved occupancy (47.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy.