Can L2 cache persistant policy be changed when kernel is running?

Hi! I am trying L2 cache persistance policy, and have two questions:

  1. Why in the official guidance (in 3.2.3.6 section) only provide “stream” or “graph” pattern? I am writing a matrix multiply kernel and actually, I only need the Default Stream. Do I need to especially create a new “stream”?

  2. Can I change the L2 cache persistance policy during my kernel running? I mean, firstly I multiply A and B, because B is small, I can cache B in L2 cache. Later I want to multiply C still in this kernel! But the previous B is useless now. Can I write some code within kernel now to throw away B?

Thank you!!!

You can link directly to a section in the programming guide. Section 3.6 doesn’t seem to be relevant. Maybe you meant 3.2.3.6? You can link to that section! Make it easy for the rest of us.

The L2 cache persistence depends on a named/created stream.

You can’t do anything about it directly from kernel code. However the cache carveout will naturally evict things you are no longer accessing, as you replace them with things you are accessing. This is typical cache behavior. So allocate B and C using the same pointer. Then use this pointer in the example given in section 3.2.3.4 (notice how that link takes you directly to section 3.2.3.4)

1 Like

Yes! Thank you! I changed my link.

So for my question, I guess, after initially set B as persistant L2 cache, later when I access C, B will automatically be evicted…I see something in official guidance said: automatic evict is not reliable…Anyway, maybe I can not do anything here.

For another question, yes, I understand now, maybe just need to create a “stream”. NV company may be just want to use one grammar to cover both default stream case and multiple stream case.

Thank you!!!

wait, you said, allocate B and C using the same pointer…? Well, interesting idea!
But actually I am creating the data in pytorch. I get the B and C’s pointer, and then access the data…In your suggestion, seems you can create B, and do B’s calculation, and then create C exactly start from B’s pointer! Interesting! But I not sure whether can I use that magic!

Thank you!!!

Suppose my B matrix is 1024x1024. And suppose my C matrix is 1024x1024. Allocate both with the same pointer:

cudaMalloc(&d_BC, 1024*1024*2*sizeof(B[0]));

Then when you need to refer to B, just use d_BC

When you need to refer to C, use d_BC+(1024*1024). i.e. pointer arithmetic.

Now both the B and C data are referred to via single pointer.

I don’t know of any reason you cannot create a torch tensor of twice the size, and refer to a specific point in a torch tensor. But if you are asking questions about pytorch you are in the wrong place.

1 Like

Well, maybe we can set two data section in L2 cache as persistant? Is it necessary to allocate them continuous like your way? (Well, maybe you are right, we can only use one persistant memory location???)
Also, even your way, the first B will still take up L2 cache space…Chances are, when we are reading C, data B still take L2’s space and do not be evicted…

Thank you!!

Sure. Then you would have a carveout for B. and a separate carveout for C. And as we have already covered, there is no way to “dump” a carveout from kernel code. So you are stuck with the carveout for B, even when you don’t need it any more within your monolithic kernel.

No, it’s not necessary. I was trying to offer an idea as to how data could be naturally aged out of a carveout during the processing of a single monolithic kernel. Combining them this way allows a single carveout with a single policy to cover both.

I don’t follow your concern. I don’t know of any cache that works that way. If an area of memory is cacheable, and you retrieve data from that region, the cache will find a place for it, evicting “older” cachelines as needed. So it seems like you are imagining that I would read data from C, and an older cacheline from B is resident in the cache, and so somehow that older cachline would be kept and the read from C would not be cached?

That doesn’t sound like cache behavior to me.

If B is taking space in L2 (lets say in a carveout), then the only reason for that is because it has not been evicted yet. And if it has not been evicted yet, then the cache mechanism has decided there is no need to evict it. But there would be a need to evict it if the cache is full, and a new request is being made. So I don’t follow your concern.

If this all is a concern for you, maybe your approach of a single monolithic kernel is not so great. Break it into a kernel that processes B, then another that processes C. You can separate your cache management that way.

1 Like

Thank you!!!