I am new to nvprof. I am using Kernel Profiling to see where my kernel is spending most of the time. I can see that about 33% of time is spent in “other”. Can someone help me understand what this category means (e.g. give me some hints about what can it be?)
To give background, I have modified a sample matrix multiplication to use persistent thread model. I have basically introduced some atomicAdd() and few writes to host pinned memory. Due to these changes, I see 30% overhead (as compared to baseline simple matrix multiplication). The “others” category is baseline is 17% whereas in the persistent thread model it is 33%. Can someone tell me how to figure this overhead out?
Question: Can atomicAdd() take upto 5us under heavy contention? (Max 30 threads will be calling atomicAdd() at same time across whole GPU)