Extension of local memory in global memory

Hi

I was wondering if there’s any extension to retrieve the id/number of the stream multiprocessor a kernel is running on.
Why?

well i was thinking that perhaps it would be possible to implement an extension of the local memory in global memory, for those
problems that require more storage for temporary data.

But allocating an appropriate storage in global memory for each group item might be very inefficient, because only few groups can run at the same time, and
after they run, they are done. So it would make sense, like for local memory, to have number_of_stream_multiprocessor memory blocks (conceptually) in global memory
for those groups running.

Does it make sense?
Yes i know it would be slower than local memory, but it might still be faster than running the problem in CPU, also due to the much higher bandwidth of graphics ram, right?

Hi

I was wondering if there’s any extension to retrieve the id/number of the stream multiprocessor a kernel is running on.
Why?

well i was thinking that perhaps it would be possible to implement an extension of the local memory in global memory, for those
problems that require more storage for temporary data.

But allocating an appropriate storage in global memory for each group item might be very inefficient, because only few groups can run at the same time, and
after they run, they are done. So it would make sense, like for local memory, to have number_of_stream_multiprocessor memory blocks (conceptually) in global memory
for those groups running.

Does it make sense?
Yes i know it would be slower than local memory, but it might still be faster than running the problem in CPU, also due to the much higher bandwidth of graphics ram, right?

Well, what’s the problem? You can address preallocated global memory buffer with local ID and global ID will guide you which chunk is for this work-group.

Well, what’s the problem? You can address preallocated global memory buffer with local ID and global ID will guide you which chunk is for this work-group.

perhaps i didnt express myself enough.

say you have a total of 1 000 000 kernels. You group them in work groups of 256 work items each.
say you need, in each kernel, 65k temporary memory , and the local memory is not enough, so you need to go global.

then you do as you say, meaning that you preallocate 65k per each kernel you are going to enqueue, and for each work group, you address the beginning
of your “chunk” like get_group_id() * get_group_size() * 65k

this way you need 1 000 000 * 65k bytes of buffer in gpu, which is 65 gigabyte.

if you, instead , allocate ( 65k * group_size ) * number_of_stream_multiprocessors (in a GT200 it is 30, in a GF1xx it is 16), then you need at most
500 megabyte, which would be feasible

and of course each work group that gets scheduled on a SM, reuse the chunk assigned to that SM

perhaps i didnt express myself enough.

say you have a total of 1 000 000 kernels. You group them in work groups of 256 work items each.
say you need, in each kernel, 65k temporary memory , and the local memory is not enough, so you need to go global.

then you do as you say, meaning that you preallocate 65k per each kernel you are going to enqueue, and for each work group, you address the beginning
of your “chunk” like get_group_id() * get_group_size() * 65k

this way you need 1 000 000 * 65k bytes of buffer in gpu, which is 65 gigabyte.

if you, instead , allocate ( 65k * group_size ) * number_of_stream_multiprocessors (in a GT200 it is 30, in a GF1xx it is 16), then you need at most
500 megabyte, which would be feasible

and of course each work group that gets scheduled on a SM, reuse the chunk assigned to that SM

Great you mentioned some numbers, now it is more clear what you are up to.

I suppose you forgot, that on a single multiprocessor doesn’t run only one work-group per time. (Read more about occupancy in programming guide.) So even small local memory accessible only to 1 multiprocessor is divided (if it is possible) between more work-groups. So according to your ideas you should be able to find out how many work-groups are running concurrently, which is not possible in my opinion.

Or buy Tesla GPU, where you can select to have more local memory or bigger cache External Image

Great you mentioned some numbers, now it is more clear what you are up to.

I suppose you forgot, that on a single multiprocessor doesn’t run only one work-group per time. (Read more about occupancy in programming guide.) So even small local memory accessible only to 1 multiprocessor is divided (if it is possible) between more work-groups. So according to your ideas you should be able to find out how many work-groups are running concurrently, which is not possible in my opinion.

Or buy Tesla GPU, where you can select to have more local memory or bigger cache External Image

hmm well, that might be true, when i read the nvidia docs, i didnt notice this thing, but i assume you can force only one work group per multiprocessor at a time, perhaps by simply allocating all the available local memory for each group. This way the driver cannot schedule another work group on the same SM , until the previous finished. Or am i wrong?

hmm well, that might be true, when i read the nvidia docs, i didnt notice this thing, but i assume you can force only one work group per multiprocessor at a time, perhaps by simply allocating all the available local memory for each group. This way the driver cannot schedule another work group on the same SM , until the previous finished. Or am i wrong?

Your are right (or schedule only the number of threads which will cover multiprocessors once). However, coding like this will ruin the speed (as memory latency is hidden by work-group switching). Use it only for testing purposes (how much kernel is memory latency bounded…)

Your are right (or schedule only the number of threads which will cover multiprocessors once). However, coding like this will ruin the speed (as memory latency is hidden by work-group switching). Use it only for testing purposes (how much kernel is memory latency bounded…)

that’s true, i think you are absolutely right such a code would not carry out great performances.
However, for a problem that is inherently parallel (e.g. you cant benefit by cool sequential algorithm formulation), it might still be faster than doing it in cpu…
It might worth trying at least, but again, it’s not possible to know, from within a kernel, on which SM you are executing.

On the other side it might be possible to enqueue only number_of_multiprocessor groups at a time, wait they finish, and go further? perhaps? but in this case performances would really be horrible

that’s true, i think you are absolutely right such a code would not carry out great performances.
However, for a problem that is inherently parallel (e.g. you cant benefit by cool sequential algorithm formulation), it might still be faster than doing it in cpu…
It might worth trying at least, but again, it’s not possible to know, from within a kernel, on which SM you are executing.

On the other side it might be possible to enqueue only number_of_multiprocessor groups at a time, wait they finish, and go further? perhaps? but in this case performances would really be horrible

I suppose you are right. However, faster would be to part the input data to fit in local memory and process them in parts.

I suppose you are right. However, faster would be to part the input data to fit in local memory and process them in parts.