per block scratch mem - more than proviced by shared mem. The data reduction in a kernel by block ne

My kernel design needs 36 million bytes of scratch memory per thread block.
but my my limited sharedMemPerBlock=49,152.

The number of blocks is likely to be large, so I cant just pass in some device memory
structure and indexed it by block number cause the device mem structure by block and thread
would be out of hand huge.

The scratch memory is needed only within the kernel doing each block.
At the end of the block processing, the kernel will data reduce the scratch memory
to only 1440 x 8 x 3= 34,560 bytes as final output for that thread block.
This yield per block is then stored by block id into a separate device memory structure
indexed by thread block.

After the data reduction for a block, any scratch memory can be reused. Its like I want
shared memory semantics, in that it exists only for the block at hand. Perfect, except there is
not nearly enough of it.

What other options do I have?

  1. Icant pass in a giant device mem structure indexed by block cause thats
    too much memory. With 1024 blocks, it Would take 1024 x 1024 x 1440 x 8 x 3 = 36,238,786,560 bytes.
  2. The scratch memory I need per block is 1024 x 1440 x 3 x 8= 35,389,440 bytes,
    and shared mem is only 49,152 bytes.

I could do 1) if the number of blocks was reduced severly, from 1024 to just a few maybe, but that
would kill MP occupancy- no?

What compute capability is your device? On 2.x, you can just use device-side malloc() and free(). On any devices you could either run your own allocation scheme using atomic operations, or use %smid to index into previously allocated buffers. Note though that %smid is not necessarily contiguous.

my device

name=Tesla C2050
major=2 minor=0

The problem is that I need to do a memory hungry reduction, after the compute phase, for the thread block. and

  1. Shared memory is too limited in size.
  2. If I pass into the kernel, a structure that can be indexed by thread block , that would work fine but
    I have been in the habit of assuming that I can do kernel launches with 1024 thread blocks, and to externally
    malloc (even with main pined memory) an array that can be indexed by thread block once inside the running kernel,
    this would be way too big. If I did 1024 kernel launches, then tha calling program could malloc the temp memory I
    need and the kernel could use it, and if sychronize the kernel launches, single file, I can malloc the memory before the kernel launch, the kernel uses it to do the reduction, pass the results of the reduction out the the caller when the kernel finishes, and then re-use the temporary memory for the next kernel launch. But this seems like
    a real departure for me from doing 1024 thread blocks in a single thread launch for other applications I have done, where this application would have me do 1024 separate sychronous kernel launches. Even if I could do a main memory malloc from inside the kernel, to malloc the storage to do my reduction, (and I dont think Ican. need to pass it into the kernel ) It would not help if I launched this thing with 1024 thread blocks like I am used to doing in a prior app. Will performance croak if I do 1024 sychronous kernel launches (which will fix by problem but will I have a performance disaster?

Actually the information in all those thread block is itself reduced more outside the scope of this issue.
The unusual (for a beginner like me) feature of this problem is that inside the kernel, the object produced by a given thread block is an array of 1440 structs where the struct is three doubles. The output, after the thread block reduction, will be a single such object, but the threads in the thread block each produce a separate such array, and the kernel reduction step adds by array elements to reduce the output of a single thread block to one 1440 array object, the yield for the thread block. All other thread blocks do the same thing, and I want to eventually further reduce the output from all thread blocks the same way ,so the output from all thread blocks is a single 1440 array of double[3] size. This block reduction is now not done by the cup and not by any gpu cuda work.

I am about to try the idea of solving my memory problem by doing 1440 separate synchronous kernel launches with one thread block each, instead of doing a single 1440 thread block kernel launch. I am grateful for any insights. I will let you know if this solution of 1440 synchronous single block kernel launches has awful performance or not.

Thank you for posting your device info.
Other than that, I’m not quite sure how your reply relates to my previous post. Given that your device is of compute capability 2.x, I’d suggest to just go ahead with my first proposal using device side malloc() and free(). Appendix B.16.2.3 of the Programming Guide has sample code that lays out the structure of such a kernel.

Just try with less blocks - 1024 is a lot. Maximum number of resident blocks per multiprocessor is 8 which gives total of 128 blocks for 16 multiprocessors (geforce 580) and maximum number of resident warps per multiprocessor is 48 for sm 2.0 (this means max 1536 threads per multiprocessor). How many threads do you have per block? It is possible (depending on algorithm) to achieve good perf with just 50% occupancy.

1024 threads per block. Thanks I will find out today. Maybe I will try 10 blocks per kernel launch. this keeps the volume of scratch memory for the reduction down to 10 * 1024 * 1440 * 3 *8 . big but maybe ok. I will see.

10 blocks won’t probably fill out the entire GPU as one block cannot be split across sms. Therefore it’s good to have a multiple of sms (16 for Geforce 580).

If possible use less threads per block (512 for example and 32 blocks?)