Why don’t you just use block ID’s, since no block is going to run on multiple MP’s and all threads in the block can then access the same part of the array… Unless you really for some odd reason want to limit yourself to using between 2 and 30 parts of an array, I see no reason to want an MP id.
I know this is quite strange. I am trying to use the GPU as a general purpose multi-processor. So the synchronization may seem somehow complex for CUDA.
Actually I am trying to build an array of structures, in which each structure corresponds to a block. But as the block number grows, I am afraid the array will become too large. So I decided to share a structure among the blocks that are executed on the same MP.
Well, there seem to be no function to get the processor ID, at least not documented in the reference manual. If there is absolutely no way to do it, I might have to figure out another approach.
The CUDA model aims to abstract from the hardware implementation (since the number of MPs varies between cards) and to have you thinking about task partitioning in terms of blocks, not the MPs they will be executing on. You might try to force a one block-per MP situation but it’s not advisable since the scheduler expects more blocks than MPs for automatic latency hiding and pipelining and you will most likely loose performance if you fight it.
Still, even if you had known the MP ID for a given block, you still wouldn’t really have any mechanism for synchronization and/or cooperation between blocks (except for maybe global memory mutexes). In the CUDA model, blocks should be independent. Making them cooperate is technically possible but likely slow. And if you had programmed them so they cooperate (using global memory for communication), there would be no bonus for running on the same MP. Global memory isn’t “closer” to any given MP, shared memory is. But you’re not supposed to assume anything about shared memory outside of a given block (even on the same MP).
You can’t make two or more blocks work on the same set of shared memory (even if they happen to execute on the same MP) so if you really need inter-block communication, you’ll need global memory and then it’s no difference which MP tries to access it. My advice - try to come up with a different approach to partitioning your problem.
By the way, how many of those big structures will you have in this array?
I couldn’t find it with a quick search, but if I remember right, some clever hacker did get the multiprocessor number once. It was hidden inside shared memory, like threadIdx is. Maybe it’s possible to get by ugly hacks similar to looking at the block’s shared memory pointer and looking at negative indices.
There is a thread on this forum somewhere about it, and it may not even be supported in CUDA 2 anymore, but it may be possible.
There are a couple advanced couple cases where the information could be useful. If you have some job results stream you want to append to in global memory without atomics (especially on 1.0 hardware) you could create output memory arrays per multiprocessor, not per block, since you know two blocks that ran sequentially on the same multiprocessor can’t interfere. This also assumes one block runs at a time of course.