I need a big chunk of temporary working memory for each resident thread

driprock · June 3, 2021, 5:12pm

Hello,

I have a situation where I need about 10MB of working memory for each thread, and after execution the space can be reused. The number of threads is on the order of 1 million. The size of the array is the same for each thread and known at compile time.

If the amount were only about 1kb, I would just have a fixed-size array in the code and it could go on the stack, but that’s not an option for 10mb.

If the number of threads were much smaller, I would pre-allocate an array with (number of threads)*(10MB), and each thread would have it’s own space, but of course the size of such an array would be several TB so that’s no good either.

What I’m doing now is pre-allocating an array of about 2*(maximum number of resident threads)(10MB), and each thread indexes into a position based on 10MB[(thread index) % (2*(maximum number of resident threads))]. However, I understand that this is a bad idea, since the order that the threads begin and end is not strictly guaranteed, so in principle two active threads might get the same index and corrupt each other’s part of the buffer.

Is there a proper way to solve this problem? Thanks.

njuffa · June 3, 2021, 6:52pm

Crunching through TBs of data would be an issue with pure host-side processing. That suggests that other people in your field may have already created software for the necessary out-of-core processing of such data, and you may be able to augment such existing solutions with GPU acceleration.

(1) Can you compress the data? A first step would be to choose the smallest data type(s) needed to store that data. If the data is naturally sparse, research which of the common sparse data storage schemes might be the most appropriate or this use case.

(2) Can you significantly reduce the number of threads per kernel launch, and trade-off with a commensurate increase in the the number of kernel launches?

(3) Can you implement some simple memory management (a buffering scheme basically) that pulls data off SSDs and stages it in system memory for subsequent use by the GPU code?

driprock · June 3, 2021, 7:18pm

Hi njuffa,

Thanks for the reply, but I think I explained the problem badly. There isn’t actually terabytes of data that I need to read in or store at once. Basically, the kernel looks like this:

Compute about 10MB of stuff and store it in the temporary array
Do some computations that use the 10MB of data. This part loops through the 10MB array many times, so it’s not practical to just compute the stuff from step 1 on the fly.
The 10MB array is no longer needed and the space can be used by another thread.

Robert_Crovella · June 3, 2021, 7:33pm

a possibly similar question was discussed here In the response I provided, the links to SO topics discuss one possible method to avoid the hazard you mention:

driprock · June 3, 2021, 7:53pm

This looks pretty close to what I am trying to do. Thanks!

Robert_Crovella · June 3, 2021, 7:55pm

I don’t remember whether I mentioned it in the previous postings or not, but using a grid-stride-loop kernel design can also be a fairly simple way to limit the number of active threads and provide a block for each.