How to communicate between blocks?

202476410arsmart · March 1, 2024, 4:06pm

If I have one SM, and two blocks can be run on it. The intermediate result of one block will be written into L1 cache (not shared memory) (will it? or directly write back to L2?) And, can another block read this intermediate result from L1??

Because I read from somewhere that, different blocks can only communicate using global memory…

striker159 · March 1, 2024, 4:37pm

Different blocks can communicate via global memory. On Hopper blocks in the same Cluster can additionally use distributed shared memory

CU_Steve · March 1, 2024, 6:31pm

I think the answer is the write goes to both L1 and L2.

See Robert’s answer:

202476410arsmart · March 2, 2024, 2:25am

Like, if block 1 writes a value back to global, but this value will also be write into L1, and another block 2, which accidentally also in the same SM, can hit this L1 intermediate result, right?

202476410arsmart · March 2, 2024, 2:28am

Also I am wondering, can two blocks from different streams works on a same SM? They can reuse data within one L1?

CU_Steve · March 2, 2024, 3:05am

I wonder the same things, but I do not know the answer to these questions.

This is about all I rely on:

Robert_Crovella · March 4, 2024, 2:11pm

Yes, blocks residing on the same SM share the same L1. If block A, on SM X, writes to global memory, and block B, on SM X, later reads from that same location in global memory, my expectation is that block B will hit in the L1, on the value that was written by block A.

Yes, two blocks from the same host process can be coresident on the same SM. They will share the L1 as already indicated.

Curefab · March 8, 2024, 10:38pm

BTW for better sync options between two blocks of the same kernel or different kernels, you could put the code into one block of the same kernel and test for thread number, e.g.

// run for example with 1024 threads per block

if (threadIdx.x <= 512) {
    // code 1
    int myidx = threadIdx.x; // 0..511
} else {
    // code 2
    int myidx = threadIdx.x - 512; // 0..511
}

202476410arsmart · March 9, 2024, 1:40am

excatly! I am thinking about matmul. If we have enough work, maybe it is better to have one huggggge block to fill a whole SM!

Topic		Replies	Views
Block sheduling and L1 cache update ...about block synchronization CUDA Programming and Performance	5	945	April 22, 2011
life span of shared memory CUDA Programming and Performance	15	6966	April 27, 2011
Dare I use L1 in this way? CUDA Programming and Performance	2	320	October 14, 2023
Ideas on data transfer between blocks? CUDA Programming and Performance	1	968	April 10, 2009
Question about memory flush and synchronization memory flush and synchronization CUDA Programming and Performance	6	4712	July 23, 2008
Getting access to shared memory from different kernels is there a way to share it? CUDA Programming and Performance	4	3748	May 13, 2009
Concurrent kernel CUDA Programming and Performance	8	1732	January 14, 2024
global memory broadcast? reading same global memory location with multiple blocks CUDA Programming and Performance	2	4812	June 6, 2011
Results from computations of several blocks CUDA Programming and Performance	2	562	April 28, 2011
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16298	July 21, 2017

How to communicate between blocks?

Related topics