You can have different blocks do different things. However, what you try above is to synchronize these blocks, which is not a good idea. And busy waiting is a particularly bad idea under CUDA.
Further, using just two blocks will leave all but the lowest end mobile GPUs underutilized.
I you want to implement a produces-consumer scheme, do it with different warps inside the same block, which can easily be synchronized with __syncthreads(), or the slightly more versatile bar.sync PTX instructions. In your example above however I can’t see the point at all - why not just consume the data in the same block where it is produced, instead of trying to make a second block busy-wait?
You can have different blocks do different things. However, what you try above is to synchronize these blocks, which is not a good idea. And busy waiting is a particularly bad idea under CUDA.
Further, using just two blocks will leave all but the lowest end mobile GPUs underutilized.
I you want to implement a produces-consumer scheme, do it with different warps inside the same block, which can easily be synchronized with __syncthreads(), or the slightly more versatile bar.sync PTX instructions. In your example above however I can’t see the point at all - why not just consume the data in the same block where it is produced, instead of trying to make a second block busy-wait?
It seems like you/your kernel are trying to create random numbers in an array as long as the element in the array is not set/-1.
When the array position is consumed with a random number, the random number is to be read and reset to not set.
Your attempt seemed to use a single flag to try and indicate this somewhat and to use a while loop to spin on the flag, all in all not such a bad idea, but it needs a bit more work.
Why you want to do this I don’t know exactly but you could try this idea:
Create two arrays:
RandomNumberGenerated[i]
RandomNumberValue[i]
Each random number (value) has it’s own “generated” flag.
When the number is generated this flag is set to 1 (=true) when/after the number is consumed this flag is set to (0=false).
The idea is to have each code section spin on this flag until it changes state.
To do that requires two threads to be running the same code but in alternative fashions.
One thread would need to execute the upper block of code, the other thread would need to execute the lower block of code.
So each [i] needs two threads.
Perhaps the threads can be switched with a case statement. Based on the thread index.
For example:
CodeBlock = ThreadIdx.x mod 2;
RandomNumberIndex = ThreadIdx.x div 2;
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1)
{
// spin and do nothing, wait for random number to be consumed
}
// generate number
RandomNumberValue[RandomNumberIndex] = curand;
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0)
{
// spin and do nothing, wait for random number to be generated
}
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
}
Copy & paste works bad so here it is one more time:
CodeBlock = ThreadIdx.x mod 2;
RandomNumberIndex = ThreadIdx.x div 2;
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1)
{
// spin and do nothing, wait for random number to be consumed
}
// generate number
RandomNumberValue[RandomNumberIndex] = curand;
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0)
{
// spin and do nothing, wait for random number to be generated
}
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
}
You could also wrap the code section above in a loop to keep doing this the whole time
Running = 1;
while (Running == 1)
{
… code section above…
}
This code is more fine tuned than your idea, your idea was to have one block do this and another block do that.
In this idea of mine one thread does this and the other thread does that.
Perhaps that’s better/easier to code.
I do see a little drawback though, serialization of a warp.
Perhaps the CodeBlock based on ThreadIdx.x can be chosen in a different way to make the first warp execute CodeBlock 0 and the second warp CodeBlock 1
That might be better for performance… perhaps my first try would even deadlock the serialization ?! External Image :)
Let’s see CodeBlock should be zero for the first 32 threads and 1 for the next 32 threads and so on. So some formula is needed for that.
It’s probably pretty simple:
CodeBlock = (ThreadIdx.x div 32) mod 2;
0 to 31/32 becomes 0,
32 to 63/32 becomes 1,
64 to 95/32 becomes 2 however mod 2 causes it to become 0 again.
And so forth so this formula should work, it’s a bit slow but so be it External Image
RandomNumberIndex should probably also be calculated differently.
Let’s see first 32 threads should be 0 to 31, next 32 threads should be 0 to 31, then next threads should be 32 to 63, next 32 should be 32 to 63.
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
It seems like you/your kernel are trying to create random numbers in an array as long as the element in the array is not set/-1.
When the array position is consumed with a random number, the random number is to be read and reset to not set.
Your attempt seemed to use a single flag to try and indicate this somewhat and to use a while loop to spin on the flag, all in all not such a bad idea, but it needs a bit more work.
Why you want to do this I don’t know exactly but you could try this idea:
Create two arrays:
RandomNumberGenerated[i]
RandomNumberValue[i]
Each random number (value) has it’s own “generated” flag.
When the number is generated this flag is set to 1 (=true) when/after the number is consumed this flag is set to (0=false).
The idea is to have each code section spin on this flag until it changes state.
To do that requires two threads to be running the same code but in alternative fashions.
One thread would need to execute the upper block of code, the other thread would need to execute the lower block of code.
So each [i] needs two threads.
Perhaps the threads can be switched with a case statement. Based on the thread index.
For example:
CodeBlock = ThreadIdx.x mod 2;
RandomNumberIndex = ThreadIdx.x div 2;
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1)
{
// spin and do nothing, wait for random number to be consumed
}
// generate number
RandomNumberValue[RandomNumberIndex] = curand;
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0)
{
// spin and do nothing, wait for random number to be generated
}
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
}
Copy & paste works bad so here it is one more time:
CodeBlock = ThreadIdx.x mod 2;
RandomNumberIndex = ThreadIdx.x div 2;
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1)
{
// spin and do nothing, wait for random number to be consumed
}
// generate number
RandomNumberValue[RandomNumberIndex] = curand;
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0)
{
// spin and do nothing, wait for random number to be generated
}
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
}
You could also wrap the code section above in a loop to keep doing this the whole time
Running = 1;
while (Running == 1)
{
… code section above…
}
This code is more fine tuned than your idea, your idea was to have one block do this and another block do that.
In this idea of mine one thread does this and the other thread does that.
Perhaps that’s better/easier to code.
I do see a little drawback though, serialization of a warp.
Perhaps the CodeBlock based on ThreadIdx.x can be chosen in a different way to make the first warp execute CodeBlock 0 and the second warp CodeBlock 1
That might be better for performance… perhaps my first try would even deadlock the serialization ?! External Image :)
Let’s see CodeBlock should be zero for the first 32 threads and 1 for the next 32 threads and so on. So some formula is needed for that.
It’s probably pretty simple:
CodeBlock = (ThreadIdx.x div 32) mod 2;
0 to 31/32 becomes 0,
32 to 63/32 becomes 1,
64 to 95/32 becomes 2 however mod 2 causes it to become 0 again.
And so forth so this formula should work, it’s a bit slow but so be it External Image
RandomNumberIndex should probably also be calculated differently.
Let’s see first 32 threads should be 0 to 31, next 32 threads should be 0 to 31, then next threads should be 32 to 63, next 32 should be 32 to 63.
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
In recent months I have started some topic because I have a project on CUDA. The problem for my project is an algorithm for stochastic simulation chemical reaction. In tis algorithm must write a pair of (reaction, mint time).
each thread
calculate propensity a
if(a>0)
2.1 r = curand
2.2 t = 1/a * ln(1/r)
all thread
if (threadId.x == BlockDim.x - 1) write to buffer (min_t, reaction)
execute the reaction.
At my first test I called kernel much times until time > time_final and in kernel I ran only one step of algorithm. Now I have a buffer which is struct array 500000 positions, and kernel run for this loop after that kernel end and host write buffer to file(I use streams for this) and kernel start again unless simulation time end. The loop is helpful because the use of shared mem for species and reactions arrays. The chemical models isn’t power of 2 and for example I use a model with 316 reactions and 1 thread/reaction. If I use one block for random and algorithm I would like more 316*2 thread and for that reason I thought to use 2 blocks one for the algorithm and the other to make random and when the first block want random to have it ready to array and don’t calculate it because to each step this take more time depending on how many have threads have a > 0.
Sorry for my English. Ask me if something don’t understand
In recent months I have started some topic because I have a project on CUDA. The problem for my project is an algorithm for stochastic simulation chemical reaction. In tis algorithm must write a pair of (reaction, mint time).
each thread
calculate propensity a
if(a>0)
2.1 r = curand
2.2 t = 1/a * ln(1/r)
all thread
if (threadId.x == BlockDim.x - 1) write to buffer (min_t, reaction)
execute the reaction.
At my first test I called kernel much times until time > time_final and in kernel I ran only one step of algorithm. Now I have a buffer which is struct array 500000 positions, and kernel run for this loop after that kernel end and host write buffer to file(I use streams for this) and kernel start again unless simulation time end. The loop is helpful because the use of shared mem for species and reactions arrays. The chemical models isn’t power of 2 and for example I use a model with 316 reactions and 1 thread/reaction. If I use one block for random and algorithm I would like more 316*2 thread and for that reason I thought to use 2 blocks one for the algorithm and the other to make random and when the first block want random to have it ready to array and don’t calculate it because to each step this take more time depending on how many have threads have a > 0.
Sorry for my English. Ask me if something don’t understand
Ok, I see, so you want code for blocks and not for warps… my bad… when I read block I thought you ment a warp External Image :) I guess tera’s advice also had something to do with it stating to use warps External Image :)
So I took that advice and used it for a first version External Image :)
But I guess my technique can be adepted to blocks as well… probably best to use two kernels and have them execute at same time/concurrently.
Perhaps later I’ll look closer into this, then I could also test these ideas myself, me getting close to being able to compile and run my own code/kernels the way I like it External Image
(Probably still gonna take a few days though before I can run them External Image)
Ok, I see, so you want code for blocks and not for warps… my bad… when I read block I thought you ment a warp External Image :) I guess tera’s advice also had something to do with it stating to use warps External Image :)
So I took that advice and used it for a first version External Image :)
But I guess my technique can be adepted to blocks as well… probably best to use two kernels and have them execute at same time/concurrently.
Perhaps later I’ll look closer into this, then I could also test these ideas myself, me getting close to being able to compile and run my own code/kernels the way I like it External Image
(Probably still gonna take a few days though before I can run them External Image)
The first version is enough 2-3 times slower than the version with buffer maybe for use of shared mem. The problem with the other code block is that the code never end
I thought when I find min because the half thread on every step is useless use them to look the array for -1 and make for them random for the next turn if this turn use it to hide the curand.
Two kernel maybe is better for that I say at the start but my card is 1.1
The first version is enough 2-3 times slower than the version with buffer maybe for use of shared mem. The problem with the other code block is that the code never end
I thought when I find min because the half thread on every step is useless use them to look the array for -1 and make for them random for the next turn if this turn use it to hide the curand.
Two kernel maybe is better for that I say at the start but my card is 1.1
Anyway to make the kernels end probably requires to jump/break out of the nested while loops.
So the nested while loops could check for another condition for example:
while ( (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running==1) )
And for the second code block:
while ( (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running==1) )
So by setting Running to 0 anywhere inside the kernel, the kernel should end…
But I would prefer a “goto + labels” statement to jump out of the while loop directly to the end of the kernel to not upset any other calculations.
(I don’t yet know if goto is possible in cuda External ImageExternal Image
Otherwise additional checks/branches are needed
So either
solution 1:
while…
{
// spinning
if Running == 0 then goto KernelExit;
}
solution 2:
while …
{
// spinning
}
// if still running
if (Running==1)
{
do stuff
}
Anyway to make the kernels end probably requires to jump/break out of the nested while loops.
So the nested while loops could check for another condition for example:
while ( (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running==1) )
And for the second code block:
while ( (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running==1) )
So by setting Running to 0 anywhere inside the kernel, the kernel should end…
But I would prefer a “goto + labels” statement to jump out of the while loop directly to the end of the kernel to not upset any other calculations.
(I don’t yet know if goto is possible in cuda External ImageExternal Image
Otherwise additional checks/branches are needed
So either
solution 1:
while…
{
// spinning
if Running == 0 then goto KernelExit;
}
solution 2:
while …
{
// spinning
}
// if still running
if (Running==1)
{
do stuff
}
Ok, entire code updated to “stop running requirement” and also “leave calculations intact requirement”.
This is how it should look like:
// div = /
// mod = %
CodeBlock = (ThreadIdx.x div 32) mod 2;
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
Running = 1;
while (Running == 1)
{
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)
{
// spin and do nothing, wait for random number to be consumed
}
if (Running == 1)
{
// generate number
RandomNumberValue[RandomNumberIndex] = curand();
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
}
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)
{
// spin and do nothing, wait for random number to be generated
}
if (Running == 1)
{
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
// stop running
Running = 0;
}
}
}
Copy & Pasteable code:
// div = /
// mod = %
CodeBlock = (ThreadIdx.x div 32) mod 2;
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
Running = 1;
while (Running == 1)
{
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)
{
// spin and do nothing, wait for random number to be consumed
}
if (Running == 1)
{
// generate number
RandomNumberValue[RandomNumberIndex] = curand();
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
}
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)
{
// spin and do nothing, wait for random number to be generated
}
if (Running == 1)
{
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
// stop running
Running = 0;
}
}
}
Give it a try and let me know how it goes =D
(I haven’t run any of this code yet, so I have absolutely no idea how it actually runs for real ! External Image =D)
Ok, entire code updated to “stop running requirement” and also “leave calculations intact requirement”.
This is how it should look like:
// div = /
// mod = %
CodeBlock = (ThreadIdx.x div 32) mod 2;
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
Running = 1;
while (Running == 1)
{
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)
{
// spin and do nothing, wait for random number to be consumed
}
if (Running == 1)
{
// generate number
RandomNumberValue[RandomNumberIndex] = curand();
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
}
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)
{
// spin and do nothing, wait for random number to be generated
}
if (Running == 1)
{
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
// stop running
Running = 0;
}
}
}
Copy & Pasteable code:
// div = /
// mod = %
CodeBlock = (ThreadIdx.x div 32) mod 2;
RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);
Running = 1;
while (Running == 1)
{
if (CodeBlock == 0)
{
while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)
{
// spin and do nothing, wait for random number to be consumed
}
if (Running == 1)
{
// generate number
RandomNumberValue[RandomNumberIndex] = curand();
// set flag to generated
RandomNumberGenerated[RandomNumberIndex] = 1;
}
} else
{
while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)
{
// spin and do nothing, wait for random number to be generated
}
if (Running == 1)
{
// consume random number
r = RandomNumberValue[RandomNumberIndex];
// set flag to consumed
RandomNumberGenerated[RandomNumberIndex] = 0;
// stop running
Running = 0;
}
}
}
Give it a try and let me know how it goes =D
(I haven’t run any of this code yet, so I have absolutely no idea how it actually runs for real ! External Image =D)
I’ll try this later but this code is in one block or two? Also kernel run for loop of algorithm and the other block just update the rand array. When loop end must change flag to end and the other block
I’ll try this later but this code is in one block or two? Also kernel run for loop of algorithm and the other block just update the rand array. When loop end must change flag to end and the other block
Currently because of ThreadIdx.x usage it is limited to the X dimension of the BlockDim.
So however many threads you can fit into the x dimension it will do… but it should be a multiple of 64.
So the minimum width (minimum dimension x) should be 64.
So for example DimBlock(64,0,0) (or maybe the zeros must be 1’s me not sure about that).
Anyway try 64 and 128 and 192 and 256 and 256+64 and 256+128 and 256+128+64 and 512 and 512+64 and so forth until your gpu cannot handle anymore in the x direction.
So the maximum is probably 512 or 1024 threads per block.
There can be as many blocks as you like but then the code probably needs to be changed a little.
The indexes need to be compensated if more than one block is to be used.
I’ll try to put formula here:
CodeBlock = (((BlockIdx.x * 64) + ThreadIdx.x) div 32) mod 2;
RandomNumberIndex = (((BlockIdx.x * 64) + ThreadIdx.x) mod 32) + ((((BlockIdx.x * 64) + ThreadIdx) div 32) * 32);
^ 64 could be replaced with BlockDim.x but see below for general formula’s they should be much better for maximum scalibility ! External Image :)
This would allow lot’s of blocks into the x direction.
So maximum blocks is probably 65536 or so, check gpu specs :)
This will allow a maximum random numbers of ThreadsPerBlock * MaximumBlocksInDirectionX;
This could further be increased if using BlockIdx.y but then formula’s need further adjustment.
(The CodeBlock adjustment is probably not even necessary but is there just in case)
The RandomNumberIndex is the data part index so this must be changed adjusted if even more is needed.
Ofcourse all of this code is untested… but give it a try and see what happens External Image
Actually this code can probably be written as follows:
Currently because of ThreadIdx.x usage it is limited to the X dimension of the BlockDim.
So however many threads you can fit into the x dimension it will do… but it should be a multiple of 64.
So the minimum width (minimum dimension x) should be 64.
So for example DimBlock(64,0,0) (or maybe the zeros must be 1’s me not sure about that).
Anyway try 64 and 128 and 192 and 256 and 256+64 and 256+128 and 256+128+64 and 512 and 512+64 and so forth until your gpu cannot handle anymore in the x direction.
So the maximum is probably 512 or 1024 threads per block.
There can be as many blocks as you like but then the code probably needs to be changed a little.
The indexes need to be compensated if more than one block is to be used.
I’ll try to put formula here:
CodeBlock = (((BlockIdx.x * 64) + ThreadIdx.x) div 32) mod 2;
RandomNumberIndex = (((BlockIdx.x * 64) + ThreadIdx.x) mod 32) + ((((BlockIdx.x * 64) + ThreadIdx) div 32) * 32);
^ 64 could be replaced with BlockDim.x but see below for general formula’s they should be much better for maximum scalibility ! External Image :)
This would allow lot’s of blocks into the x direction.
So maximum blocks is probably 65536 or so, check gpu specs :)
This will allow a maximum random numbers of ThreadsPerBlock * MaximumBlocksInDirectionX;
This could further be increased if using BlockIdx.y but then formula’s need further adjustment.
(The CodeBlock adjustment is probably not even necessary but is there just in case)
The RandomNumberIndex is the data part index so this must be changed adjusted if even more is needed.
Ofcourse all of this code is untested… but give it a try and see what happens External Image
Actually this code can probably be written as follows: