what about "broadcast" mechanism in shared memory?

Besides solving bank conflicts, can it also notify the other threads that also reads this address when one of the thread write to the address? For example, all 16 threads are reading " from an address within the same 32-bit word." When thread 0 changes the value at the address, will the other 1~15 threads get notified? I guess not? If I want to check the value change, is there other way than put check in a dead loop?

I don’t think the “broadcasting” notifies threads. It just allows all threads in a half-warp to access the same shared memory bank without conflicting. You can set some sort of flag in shared memory which the other threads can see, but there is no notification scheme built into cuda.

Keep in mind that the entire warp executes the same instruction (some threads possibly masked out), so within a warp no notification is needed when an smem location is modified. For the entire threadblock, you’d have to use __synchtreads() to make sure that changes written by one thread, are visible in all other warps.