How to use discard_memory?

#include <cuda/annotated_ptr>
__device__ int compute(int* scratch, size_t N);

__global__ void kernel(int const* in, int* out, int* scratch, size_t N) {
    // Each thread reads N elements into the scratch pad:
    for (int i = 0; i < N; ++i) {
        int idx = threadIdx.x + i * blockDim.x;
        scratch[idx] = in[idx];

    // All threads compute on the scratch pad:
    int result = compute(scratch, N);

    // All threads discard the scratch pad memory to _hint_ that it does not need to be flushed from the cache:
    cuda::discard_memory(scratch + threadIdx.x * N, N * sizeof(int));

    out[threadIdx.x] = result;

Well, if different blocks are using a same data block, if one block finishes computing and discard the data, then other thread block will have no data to use! I think here…maybe we have to use something like, all_block_syncthread, but that sounds very stupid… So how can I benefit from this discard_memory??

Thank you!!!

You cannot discard a memory range used by multiple blocks unless all those blocks are done working with the memory. It would be a race condition. The same applies for threads within the same block.

Effects: equivalent to memset(ptr, _indeterminate_, nbytes)

The sample kernel above assumes only 1 block.

Emmm… So discard_memory only useful when data is useful in one block… sounds strange…