I’m trying to do a lot of thread intercommunication to distribute workloads,
things like having a warp do a compaction for just that warp alone. It’s possible by doing various read/writes into shared memory. But I have a few questions, these may not even be documented, but maybe they’re familiar to some CUDA hackers.
Is there a good NOOP function? Sometimes I want to delay a beat to allow shared memory writes to be recorded without doing a block-wide syncthreads. I can use some kind of useless compute as a manual no-op but it’s always dangerous if the compiler gets smart and optimizes it away. (and yes, depending on shared memory without syncthreads() is dangerous, but if it’s all inside the same warp you can abuse your knowledge of what threads are reading and writing… but you sometimes need a one-op delay for a thread in one half-warp to “see” the memory write from a thread in the other half warp.)
Is there any way to retrieve the “live thread” mask for a warp? CUDA must have this or something similar handy internally since it’s using it to remember which threads are alive or not. [in fact it must have a stack of these per warp!]. I want it because when doing a reduction, you have to know which threads are DEAD so you can patch up their non-involvement. I wrote my own “live thread bitmask tester” but it takes 64 ops and uses shared memory.
Is there a way to figure out what SM the current block is running on? Maybe I have a resource (like a global memory scratchpad) that I want to use, and it’d be cool if I just had a unique SM identifier to index into that scratchpad and not have to worry about other blocks accessing it. Otherwise I need the overhead of global atomics and lots of extra annoying harness code.
Inline assembly? I am pretty sure this is “no” but do we even have any hints about possibile updates? What do people do now to tweak PTX, a perl postprocess to manually paste ops in or something?
Since it is used in the reduction example I think it can’t really be that dangerous. You have to be very careful with branches/predicates like
int local = shared;
shared = global;
Since you can’t really know in which order the compiler will put it.
What exactly do you need the delay for? Since the whole warp always executes as one block I can not see how it would help. Maybe you just got lucky and whatever you changed made the compiler put the two code paths of a branch in a different order? Or made it split a branch into two branches?
I would think it only has some kind of “enabled” mask which is not what you want, since it will also change with branches…
64 ops for doing what exactly? Using an array with one byte for each thread should not need that many instructions, though quite a bit of shared memory.
I do not understand that question at all, even on old cards there can be up to 8 blocks on the same SM and the SM can switch between them any time, so how can you avoid atomics by knowing the SM?
The best you can do I can think of is reducing the number of blocks to about 8 * number of SMs and then using one separate buffer for each block.
I did not get that desparate yet (and mostly I had the impression that the specific assembler code does not make that much of a difference, as long as you get the compiler to not use local memory), but given your questions I have to wonder if you considered if it would not be less painful to just write everything (well, the parts where you really need those special optimizations) in ptx?
I do not know how well ptxas does, but the regular assemblers like yasm are very powerful making assembler programming far less of a pain than it was.
There are read-after-write issues if there isn’t some delay. This is explicitly documented in the programming guide for REGISTERS but I think I’ve seen it in shared memory too. But thanks for the pointer about branch ordering. I don’t think that’s what I’m seeing, but it could be a gotcha in other cases.
Yep, it’s that “enabled” mask which would be useful. The problem: you have a warp. Some of the threads of that warp may have a work result to report, some don’t. you want to do a compaction on those results so you can output those results without gaps. A reduction is a standard parallel operation, but we’re complicated by the fact that we have some threads that may not even be active! So those inactive threads can’t participate in the compaction compute, making a simple operation much more complicated.
Some pseudocode for a warp like:
If your thread has a job to do:
----Work on the job.
----If your job has produced a result to report,
--------Use compaction over the warp threads to figure out on output index to write to
Hmm, I really can’t think of a reason why writing-then-reading shared memory from the same warp will break anything. Can you post an example where this happens? It just shouldn’t (if writing-then-reading works on one thread, it should work on all of them).
To do per-warp compaction you could use smem atomics (GT200). You just do an atomic increment by the threads that have results, and get an index for each one. I believe threads in the same warp will maintain the same ordering.
To figure out the current SM: I’ve searched and got an answer. First I remember reading something about interesting variables being accessible at the hardware level, some tucked in a secret portion of smem, some elsewhere. This brought me to the decuda README file, which has a lot of interesting information. Then it sent me to this thread: http://forums.nvidia.com/index.php?showtopic=38393&hl=
Of course, this brings us to inline PTX… which is not supported and has never been promised to us despite many times us asking. If you write some scripts that can automatically insert instructions, that would be very cool.