I have written a sample cuda program where I am launching 1 thread and I am reading data from an array based on offset size and then I have profiled the code using Nvprof. I am interested into L2 events like l2_subpk_total_read_sector_queries , where k is the slice number. Based on the definition, this gives the aggregated L2 read queries from that particular slice. However, if I turn the aggregate-mode off , then each the event gets divided into 16 more events. e.g l2_subp0_total_read_sector_queries gets divided into 16 more sub-events like l2_subp0_total_read_sector_queries(n) where n is the sub-event that varies from 0-15.
I am confused what this n signifies. From the definition of the aggregate-mode off it is mentioned that it gives the event values for all the units that are available for that particular event. Does that mean that slice 0 & 1 of L2 is again divided into smaller sub-slices and each slice consists of 16 sub-slices and there are total of 32 sub-slices under slices 0 & 1. I am currently using P100 GPU for my work. Please let me know if there are I need to provide any further details or any other clarification. Thank you.