one kernel two codes - cuda arch 1.1

Can I have a kernel with 2 blocks and every block run other code?

Basically I want one block update an array which other block use it.

__device__ int flag;

void mykernel(...)

{

flag = 1;

...

if(blockid.x == 0)

{

while(flag == 1)

    if(array[threadid.x] == -1)

      array[threadId.x] = curand; //first time do this in other kenrel

}

else

{

for(...)

  {

    if(a>0)

    {

      r = array[threadId.x];

      array[ThreadId.x] = -1;

    }

    ...

}//end loop

  if(threadId.x == 0)

    flag = 0;

}

}

In code above I have the flag when the primary code end then the secondary end too, but my program never stop.

Can I do this or this is acceptable for cards with cuda arch 2.x with multiple kernels?

Can I have a kernel with 2 blocks and every block run other code?

Basically I want one block update an array which other block use it.

__device__ int flag;

void mykernel(...)

{

flag = 1;

...

if(blockid.x == 0)

{

while(flag == 1)

    if(array[threadid.x] == -1)

      array[threadId.x] = curand; //first time do this in other kenrel

}

else

{

for(...)

  {

    if(a>0)

    {

      r = array[threadId.x];

      array[ThreadId.x] = -1;

    }

    ...

}//end loop

  if(threadId.x == 0)

    flag = 0;

}

}

In code above I have the flag when the primary code end then the secondary end too, but my program never stop.

Can I do this or this is acceptable for cards with cuda arch 2.x with multiple kernels?

You can have different blocks do different things. However, what you try above is to synchronize these blocks, which is not a good idea. And busy waiting is a particularly bad idea under CUDA.

Further, using just two blocks will leave all but the lowest end mobile GPUs underutilized.

I you want to implement a produces-consumer scheme, do it with different warps inside the same block, which can easily be synchronized with __syncthreads(), or the slightly more versatile bar.sync PTX instructions. In your example above however I can’t see the point at all - why not just consume the data in the same block where it is produced, instead of trying to make a second block busy-wait?

You can have different blocks do different things. However, what you try above is to synchronize these blocks, which is not a good idea. And busy waiting is a particularly bad idea under CUDA.

Further, using just two blocks will leave all but the lowest end mobile GPUs underutilized.

I you want to implement a produces-consumer scheme, do it with different warps inside the same block, which can easily be synchronized with __syncthreads(), or the slightly more versatile bar.sync PTX instructions. In your example above however I can’t see the point at all - why not just consume the data in the same block where it is produced, instead of trying to make a second block busy-wait?

It seems like you/your kernel are trying to create random numbers in an array as long as the element in the array is not set/-1.

When the array position is consumed with a random number, the random number is to be read and reset to not set.

Your attempt seemed to use a single flag to try and indicate this somewhat and to use a while loop to spin on the flag, all in all not such a bad idea, but it needs a bit more work.

Why you want to do this I don’t know exactly but you could try this idea:

Create two arrays:

RandomNumberGenerated[i]

RandomNumberValue[i]

Each random number (value) has it’s own “generated” flag.

When the number is generated this flag is set to 1 (=true) when/after the number is consumed this flag is set to (0=false).

The idea is to have each code section spin on this flag until it changes state.

To do that requires two threads to be running the same code but in alternative fashions.

One thread would need to execute the upper block of code, the other thread would need to execute the lower block of code.

So each [i] needs two threads.

Perhaps the threads can be switched with a case statement. Based on the thread index.

For example:

CodeBlock = ThreadIdx.x mod 2;

RandomNumberIndex = ThreadIdx.x div 2;

if (CodeBlock == 0)

{

   while (RandomNumberGenerated[RandomNumberIndex] == 1)

   {

     // spin and do nothing, wait for random number to be consumed

   }

// generate number

   RandomNumberValue[RandomNumberIndex] = curand;

// set flag to generated

   RandomNumberGenerated[RandomNumberIndex] = 1;

} else

{

   while (RandomNumberGenerated[RandomNumberIndex] == 0)

   {

     // spin and do nothing, wait for random number to be generated

   }

// consume random number

   r = RandomNumberValue[RandomNumberIndex];

// set flag to consumed

   RandomNumberGenerated[RandomNumberIndex] = 0;

}

Copy & paste works bad so here it is one more time:

CodeBlock = ThreadIdx.x mod 2;

RandomNumberIndex = ThreadIdx.x div 2;

if (CodeBlock == 0)

{

while (RandomNumberGenerated[RandomNumberIndex] == 1)

{

 // spin and do nothing, wait for random number to be consumed

}

// generate number

RandomNumberValue[RandomNumberIndex] = curand;

// set flag to generated

RandomNumberGenerated[RandomNumberIndex] = 1;

} else

{

while (RandomNumberGenerated[RandomNumberIndex] == 0)

{

 // spin and do nothing, wait for random number to be generated

}

// consume random number

r = RandomNumberValue[RandomNumberIndex];

// set flag to consumed

RandomNumberGenerated[RandomNumberIndex] = 0;

}

You could also wrap the code section above in a loop to keep doing this the whole time

Running = 1;

while (Running == 1)

{

… code section above…

}

This code is more fine tuned than your idea, your idea was to have one block do this and another block do that.

In this idea of mine one thread does this and the other thread does that.

Perhaps that’s better/easier to code.

I do see a little drawback though, serialization of a warp.

Perhaps the CodeBlock based on ThreadIdx.x can be chosen in a different way to make the first warp execute CodeBlock 0 and the second warp CodeBlock 1

That might be better for performance… perhaps my first try would even deadlock the serialization ?! External Image :)

Let’s see CodeBlock should be zero for the first 32 threads and 1 for the next 32 threads and so on. So some formula is needed for that.

It’s probably pretty simple:

CodeBlock = (ThreadIdx.x div 32) mod 2;

0 to 31/32 becomes 0,

32 to 63/32 becomes 1,

64 to 95/32 becomes 2 however mod 2 causes it to become 0 again.

And so forth so this formula should work, it’s a bit slow but so be it External Image

RandomNumberIndex should probably also be calculated differently.

Let’s see first 32 threads should be 0 to 31, next 32 threads should be 0 to 31, then next threads should be 32 to 63, next 32 should be 32 to 63.

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

That should probably do it ! External Image =D

So that probably answers your question.

Theoretical this code could work, but what happens in practice I don’t know…

Try it out and let us know what happens ! External Image =D

I guess this idea could be used for a producer/consumer model.

One code block would “produce”, while the other would “consume”.

So then the cuda kernel becomes self-feeding, until perhaps a signal from the host is given to quite.

However this idea might be silly because you might as well let a single thread perform both blocks of code in alternating fashion.

However this idea is interesting if there is a one to many relation.

One thread/producer which produces a lot, and many threads/consumer trying to solve it/consuming.

For such a situation, the variables/indexes need to be changed a bit, so that one thread executes the producer and other threads the consumers.

Perhaps multiple kernels is better for that…

They could be spinning and such on global memory, so it might be possible.

Question is if this spinning will actually work… it might for aligned/32 bit quantities External Image

It seems like you/your kernel are trying to create random numbers in an array as long as the element in the array is not set/-1.

When the array position is consumed with a random number, the random number is to be read and reset to not set.

Your attempt seemed to use a single flag to try and indicate this somewhat and to use a while loop to spin on the flag, all in all not such a bad idea, but it needs a bit more work.

Why you want to do this I don’t know exactly but you could try this idea:

Create two arrays:

RandomNumberGenerated[i]

RandomNumberValue[i]

Each random number (value) has it’s own “generated” flag.

When the number is generated this flag is set to 1 (=true) when/after the number is consumed this flag is set to (0=false).

The idea is to have each code section spin on this flag until it changes state.

To do that requires two threads to be running the same code but in alternative fashions.

One thread would need to execute the upper block of code, the other thread would need to execute the lower block of code.

So each [i] needs two threads.

Perhaps the threads can be switched with a case statement. Based on the thread index.

For example:

CodeBlock = ThreadIdx.x mod 2;

RandomNumberIndex = ThreadIdx.x div 2;

if (CodeBlock == 0)

{

   while (RandomNumberGenerated[RandomNumberIndex] == 1)

   {

     // spin and do nothing, wait for random number to be consumed

   }

// generate number

   RandomNumberValue[RandomNumberIndex] = curand;

// set flag to generated

   RandomNumberGenerated[RandomNumberIndex] = 1;

} else

{

   while (RandomNumberGenerated[RandomNumberIndex] == 0)

   {

     // spin and do nothing, wait for random number to be generated

   }

// consume random number

   r = RandomNumberValue[RandomNumberIndex];

// set flag to consumed

   RandomNumberGenerated[RandomNumberIndex] = 0;

}

Copy & paste works bad so here it is one more time:

CodeBlock = ThreadIdx.x mod 2;

RandomNumberIndex = ThreadIdx.x div 2;

if (CodeBlock == 0)

{

while (RandomNumberGenerated[RandomNumberIndex] == 1)

{

 // spin and do nothing, wait for random number to be consumed

}

// generate number

RandomNumberValue[RandomNumberIndex] = curand;

// set flag to generated

RandomNumberGenerated[RandomNumberIndex] = 1;

} else

{

while (RandomNumberGenerated[RandomNumberIndex] == 0)

{

 // spin and do nothing, wait for random number to be generated

}

// consume random number

r = RandomNumberValue[RandomNumberIndex];

// set flag to consumed

RandomNumberGenerated[RandomNumberIndex] = 0;

}

You could also wrap the code section above in a loop to keep doing this the whole time

Running = 1;

while (Running == 1)

{

… code section above…

}

This code is more fine tuned than your idea, your idea was to have one block do this and another block do that.

In this idea of mine one thread does this and the other thread does that.

Perhaps that’s better/easier to code.

I do see a little drawback though, serialization of a warp.

Perhaps the CodeBlock based on ThreadIdx.x can be chosen in a different way to make the first warp execute CodeBlock 0 and the second warp CodeBlock 1

That might be better for performance… perhaps my first try would even deadlock the serialization ?! External Image :)

Let’s see CodeBlock should be zero for the first 32 threads and 1 for the next 32 threads and so on. So some formula is needed for that.

It’s probably pretty simple:

CodeBlock = (ThreadIdx.x div 32) mod 2;

0 to 31/32 becomes 0,

32 to 63/32 becomes 1,

64 to 95/32 becomes 2 however mod 2 causes it to become 0 again.

And so forth so this formula should work, it’s a bit slow but so be it External Image

RandomNumberIndex should probably also be calculated differently.

Let’s see first 32 threads should be 0 to 31, next 32 threads should be 0 to 31, then next threads should be 32 to 63, next 32 should be 32 to 63.

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

That should probably do it ! External Image =D

So that probably answers your question.

Theoretical this code could work, but what happens in practice I don’t know…

Try it out and let us know what happens ! External Image =D

I guess this idea could be used for a producer/consumer model.

One code block would “produce”, while the other would “consume”.

So then the cuda kernel becomes self-feeding, until perhaps a signal from the host is given to quite.

However this idea might be silly because you might as well let a single thread perform both blocks of code in alternating fashion.

However this idea is interesting if there is a one to many relation.

One thread/producer which produces a lot, and many threads/consumer trying to solve it/consuming.

For such a situation, the variables/indexes need to be changed a bit, so that one thread executes the producer and other threads the consumers.

Perhaps multiple kernels is better for that…

They could be spinning and such on global memory, so it might be possible.

Question is if this spinning will actually work… it might for aligned/32 bit quantities External Image

Thank toy very much for yours answer (@Skybuck: WOW don’t you tired? External Image ).

In recent months I have started some topic because I have a project on CUDA. The problem for my project is an algorithm for stochastic simulation chemical reaction. In tis algorithm must write a pair of (reaction, mint time).

each thread

  1. calculate propensity a
  2. if(a>0)
    2.1 r = curand
    2.2 t = 1/a * ln(1/r)
    all thread
  3. find reaction with min t (The Official NVIDIA Forums | NVIDIA)
  4. if (threadId.x == BlockDim.x - 1) write to buffer (min_t, reaction)
  5. execute the reaction.

At my first test I called kernel much times until time > time_final and in kernel I ran only one step of algorithm. Now I have a buffer which is struct array 500000 positions, and kernel run for this loop after that kernel end and host write buffer to file(I use streams for this) and kernel start again unless simulation time end. The loop is helpful because the use of shared mem for species and reactions arrays. The chemical models isn’t power of 2 and for example I use a model with 316 reactions and 1 thread/reaction. If I use one block for random and algorithm I would like more 316*2 thread and for that reason I thought to use 2 blocks one for the algorithm and the other to make random and when the first block want random to have it ready to array and don’t calculate it because to each step this take more time depending on how many have threads have a > 0.

Sorry for my English. Ask me if something don’t understand

Thank toy very much for yours answer (@Skybuck: WOW don’t you tired? External Image ).

In recent months I have started some topic because I have a project on CUDA. The problem for my project is an algorithm for stochastic simulation chemical reaction. In tis algorithm must write a pair of (reaction, mint time).

each thread

  1. calculate propensity a
  2. if(a>0)
    2.1 r = curand
    2.2 t = 1/a * ln(1/r)
    all thread
  3. find reaction with min t (The Official NVIDIA Forums | NVIDIA)
  4. if (threadId.x == BlockDim.x - 1) write to buffer (min_t, reaction)
  5. execute the reaction.

At my first test I called kernel much times until time > time_final and in kernel I ran only one step of algorithm. Now I have a buffer which is struct array 500000 positions, and kernel run for this loop after that kernel end and host write buffer to file(I use streams for this) and kernel start again unless simulation time end. The loop is helpful because the use of shared mem for species and reactions arrays. The chemical models isn’t power of 2 and for example I use a model with 316 reactions and 1 thread/reaction. If I use one block for random and algorithm I would like more 316*2 thread and for that reason I thought to use 2 blocks one for the algorithm and the other to make random and when the first block want random to have it ready to array and don’t calculate it because to each step this take more time depending on how many have threads have a > 0.

Sorry for my English. Ask me if something don’t understand

Ok, I see, so you want code for blocks and not for warps… my bad… when I read block I thought you ment a warp External Image :) I guess tera’s advice also had something to do with it stating to use warps External Image :)
So I took that advice and used it for a first version External Image :)

But I guess my technique can be adepted to blocks as well… probably best to use two kernels and have them execute at same time/concurrently.

Perhaps later I’ll look closer into this, then I could also test these ideas myself, me getting close to being able to compile and run my own code/kernels the way I like it External Image
(Probably still gonna take a few days though before I can run them External Image)

Ok, I see, so you want code for blocks and not for warps… my bad… when I read block I thought you ment a warp External Image :) I guess tera’s advice also had something to do with it stating to use warps External Image :)
So I took that advice and used it for a first version External Image :)

But I guess my technique can be adepted to blocks as well… probably best to use two kernels and have them execute at same time/concurrently.

Perhaps later I’ll look closer into this, then I could also test these ideas myself, me getting close to being able to compile and run my own code/kernels the way I like it External Image
(Probably still gonna take a few days though before I can run them External Image)

The first version is enough 2-3 times slower than the version with buffer maybe for use of shared mem. The problem with the other code block is that the code never end

I thought when I find min because the half thread on every step is useless use them to look the array for -1 and make for them random for the next turn if this turn use it to hide the curand.

Two kernel maybe is better for that I say at the start but my card is 1.1

The first version is enough 2-3 times slower than the version with buffer maybe for use of shared mem. The problem with the other code block is that the code never end

I thought when I find min because the half thread on every step is useless use them to look the array for -1 and make for them random for the next turn if this turn use it to hide the curand.

Two kernel maybe is better for that I say at the start but my card is 1.1

Yeah so when do you want to end the kernels ?

Perhaps when the random number is consumed ?

Anyway to make the kernels end probably requires to jump/break out of the nested while loops.

So the nested while loops could check for another condition for example:

while ( (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running==1) )

And for the second code block:

while ( (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running==1) )

So by setting Running to 0 anywhere inside the kernel, the kernel should end…

But I would prefer a “goto + labels” statement to jump out of the while loop directly to the end of the kernel to not upset any other calculations.
(I don’t yet know if goto is possible in cuda External Image External Image

Otherwise additional checks/branches are needed

So either
solution 1:

while…
{
// spinning
if Running == 0 then goto KernelExit;
}

solution 2:

while …
{
// spinning
}
// if still running
if (Running==1)
{
do stuff
}

Yeah so when do you want to end the kernels ?

Perhaps when the random number is consumed ?

Anyway to make the kernels end probably requires to jump/break out of the nested while loops.

So the nested while loops could check for another condition for example:

while ( (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running==1) )

And for the second code block:

while ( (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running==1) )

So by setting Running to 0 anywhere inside the kernel, the kernel should end…

But I would prefer a “goto + labels” statement to jump out of the while loop directly to the end of the kernel to not upset any other calculations.
(I don’t yet know if goto is possible in cuda External Image External Image

Otherwise additional checks/branches are needed

So either
solution 1:

while…
{
// spinning
if Running == 0 then goto KernelExit;
}

solution 2:

while …
{
// spinning
}
// if still running
if (Running==1)
{
do stuff
}

Ok, entire code updated to “stop running requirement” and also “leave calculations intact requirement”.

This is how it should look like:

// div = /

// mod = %

CodeBlock = (ThreadIdx.x div 32) mod 2; 

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

Running = 1;

while (Running == 1)

{

	if (CodeBlock == 0)

	{

		while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)

		{

			// spin and do nothing, wait for random number to be consumed

		}

		if (Running == 1)

		{

			// generate number

			RandomNumberValue[RandomNumberIndex] = curand();

			

			// set flag to generated

			RandomNumberGenerated[RandomNumberIndex] = 1;

		}	   

	} else

	{

		while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)

		{

			// spin and do nothing, wait for random number to be generated

		}

		if (Running == 1)

		{

			// consume random number

			r = RandomNumberValue[RandomNumberIndex];

			// set flag to consumed

			RandomNumberGenerated[RandomNumberIndex] = 0;

			// stop running

			Running = 0;

		}

	} 

}

Copy & Pasteable code:

// div = /

// mod = %

CodeBlock = (ThreadIdx.x div 32) mod 2;

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

Running = 1;

while (Running == 1)

{

if (CodeBlock == 0)

{

	while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)

	{

 		// spin and do nothing, wait for random number to be consumed

	}

	if (Running == 1)

	{

		// generate number

		RandomNumberValue[RandomNumberIndex] = curand();

		

		// set flag to generated

		RandomNumberGenerated[RandomNumberIndex] = 1;

	}	   

} else

{

	while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)

	{

		// spin and do nothing, wait for random number to be generated

	}

	if (Running == 1)

	{

		// consume random number

		r = RandomNumberValue[RandomNumberIndex];

		// set flag to consumed

		RandomNumberGenerated[RandomNumberIndex] = 0;

		// stop running

		Running = 0;

	}

} 

}

Give it a try and let me know how it goes =D

(I haven’t run any of this code yet, so I have absolutely no idea how it actually runs for real ! External Image =D)

Ok, entire code updated to “stop running requirement” and also “leave calculations intact requirement”.

This is how it should look like:

// div = /

// mod = %

CodeBlock = (ThreadIdx.x div 32) mod 2; 

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

Running = 1;

while (Running == 1)

{

	if (CodeBlock == 0)

	{

		while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)

		{

			// spin and do nothing, wait for random number to be consumed

		}

		if (Running == 1)

		{

			// generate number

			RandomNumberValue[RandomNumberIndex] = curand();

			

			// set flag to generated

			RandomNumberGenerated[RandomNumberIndex] = 1;

		}	   

	} else

	{

		while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)

		{

			// spin and do nothing, wait for random number to be generated

		}

		if (Running == 1)

		{

			// consume random number

			r = RandomNumberValue[RandomNumberIndex];

			// set flag to consumed

			RandomNumberGenerated[RandomNumberIndex] = 0;

			// stop running

			Running = 0;

		}

	} 

}

Copy & Pasteable code:

// div = /

// mod = %

CodeBlock = (ThreadIdx.x div 32) mod 2;

RandomNumberIndex = (ThreadIdx.x mod 32) + ((ThreadIdx div 32) * 32);

Running = 1;

while (Running == 1)

{

if (CodeBlock == 0)

{

	while (RandomNumberGenerated[RandomNumberIndex] == 1) && (Running == 1)

	{

 		// spin and do nothing, wait for random number to be consumed

	}

	if (Running == 1)

	{

		// generate number

		RandomNumberValue[RandomNumberIndex] = curand();

		

		// set flag to generated

		RandomNumberGenerated[RandomNumberIndex] = 1;

	}	   

} else

{

	while (RandomNumberGenerated[RandomNumberIndex] == 0) && (Running == 1)

	{

		// spin and do nothing, wait for random number to be generated

	}

	if (Running == 1)

	{

		// consume random number

		r = RandomNumberValue[RandomNumberIndex];

		// set flag to consumed

		RandomNumberGenerated[RandomNumberIndex] = 0;

		// stop running

		Running = 0;

	}

} 

}

Give it a try and let me know how it goes =D

(I haven’t run any of this code yet, so I have absolutely no idea how it actually runs for real ! External Image =D)

I’ll try this later but this code is in one block or two? Also kernel run for loop of algorithm and the other block just update the rand array. When loop end must change flag to end and the other block

I’ll try this later but this code is in one block or two? Also kernel run for loop of algorithm and the other block just update the rand array. When loop end must change flag to end and the other block

Currently because of ThreadIdx.x usage it is limited to the X dimension of the BlockDim.

So however many threads you can fit into the x dimension it will do… but it should be a multiple of 64.

So the minimum width (minimum dimension x) should be 64.

So for example DimBlock(64,0,0) (or maybe the zeros must be 1’s me not sure about that).

Anyway try 64 and 128 and 192 and 256 and 256+64 and 256+128 and 256+128+64 and 512 and 512+64 and so forth until your gpu cannot handle anymore in the x direction.

So the maximum is probably 512 or 1024 threads per block.

There can be as many blocks as you like but then the code probably needs to be changed a little.

The indexes need to be compensated if more than one block is to be used.

I’ll try to put formula here:

CodeBlock = (((BlockIdx.x * 64) + ThreadIdx.x) div 32) mod 2;

RandomNumberIndex = (((BlockIdx.x * 64) + ThreadIdx.x) mod 32) + ((((BlockIdx.x * 64) + ThreadIdx) div 32) * 32);

^ 64 could be replaced with BlockDim.x but see below for general formula’s they should be much better for maximum scalibility ! External Image :)

This would allow lot’s of blocks into the x direction.

So maximum blocks is probably 65536 or so, check gpu specs :)

This will allow a maximum random numbers of ThreadsPerBlock * MaximumBlocksInDirectionX;

This could further be increased if using BlockIdx.y but then formula’s need further adjustment.

(The CodeBlock adjustment is probably not even necessary but is there just in case)

The RandomNumberIndex is the data part index so this must be changed adjusted if even more is needed.

Ofcourse all of this code is untested… but give it a try and see what happens External Image

Actually this code can probably be written as follows:

// general indexing formula’s:

ThreadWidth = BlockDim.x;

ThreadHeight = BlockDim.y;

ThreadDepth = BlockDim.z;

ThreadArea = ThreadWidth * ThreadHeight;

ThreadVolume = ThreadDepth * ThreadArea;

ThreadIndex = (ThreadIdx.z * ThreadArea) + (ThreadIdx.y * ThreadWidth) + ThreadIdx.x;

BlockWidth = GridDim.x;

BlockHeight = GridDim.y;

BlockDepth = GridDim.z;

BlockArea = BlockWidth * BlockHeight;

BlockVolume = BlockDepth * BlockArea;

BlockIndex = (BlockIdx.z * BlockArea) + (BlockIdx.y * BlockWidth) + BlockIdx.x;

FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

CodeBlock = (FinalIndex div 32) mod 2;

RandomNumberIndex = (FinalIndex mod 32) + ((FinalIndex div 32) * 32);

This should give maximum flexiblity and power.

I like these formula’s much better because the naming of the build-in indexes and such is giving me the tits.

Perhaps my general formula code might require too many registers… but give it a try and see what happens External Image :)

At least it scales well ! External Image :)

Also to be able to determine the maximum memory allocation for the memory arrays this is also a handy formula:

FinalVolume = (BlockVolume * ThreadVolume);

Usage example

malloc/allocate/etc( …Pointer… , SizeOf(ElementType) * FinalVolume );

For even more info and functions about general indexing formula’s see this thread at the bottom:

Currently because of ThreadIdx.x usage it is limited to the X dimension of the BlockDim.

So however many threads you can fit into the x dimension it will do… but it should be a multiple of 64.

So the minimum width (minimum dimension x) should be 64.

So for example DimBlock(64,0,0) (or maybe the zeros must be 1’s me not sure about that).

Anyway try 64 and 128 and 192 and 256 and 256+64 and 256+128 and 256+128+64 and 512 and 512+64 and so forth until your gpu cannot handle anymore in the x direction.

So the maximum is probably 512 or 1024 threads per block.

There can be as many blocks as you like but then the code probably needs to be changed a little.

The indexes need to be compensated if more than one block is to be used.

I’ll try to put formula here:

CodeBlock = (((BlockIdx.x * 64) + ThreadIdx.x) div 32) mod 2;

RandomNumberIndex = (((BlockIdx.x * 64) + ThreadIdx.x) mod 32) + ((((BlockIdx.x * 64) + ThreadIdx) div 32) * 32);

^ 64 could be replaced with BlockDim.x but see below for general formula’s they should be much better for maximum scalibility ! External Image :)

This would allow lot’s of blocks into the x direction.

So maximum blocks is probably 65536 or so, check gpu specs :)

This will allow a maximum random numbers of ThreadsPerBlock * MaximumBlocksInDirectionX;

This could further be increased if using BlockIdx.y but then formula’s need further adjustment.

(The CodeBlock adjustment is probably not even necessary but is there just in case)

The RandomNumberIndex is the data part index so this must be changed adjusted if even more is needed.

Ofcourse all of this code is untested… but give it a try and see what happens External Image

Actually this code can probably be written as follows:

// general indexing formula’s:

ThreadWidth = BlockDim.x;

ThreadHeight = BlockDim.y;

ThreadDepth = BlockDim.z;

ThreadArea = ThreadWidth * ThreadHeight;

ThreadVolume = ThreadDepth * ThreadArea;

ThreadIndex = (ThreadIdx.z * ThreadArea) + (ThreadIdx.y * ThreadWidth) + ThreadIdx.x;

BlockWidth = GridDim.x;

BlockHeight = GridDim.y;

BlockDepth = GridDim.z;

BlockArea = BlockWidth * BlockHeight;

BlockVolume = BlockDepth * BlockArea;

BlockIndex = (BlockIdx.z * BlockArea) + (BlockIdx.y * BlockWidth) + BlockIdx.x;

FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

CodeBlock = (FinalIndex div 32) mod 2;

RandomNumberIndex = (FinalIndex mod 32) + ((FinalIndex div 32) * 32);

This should give maximum flexiblity and power.

I like these formula’s much better because the naming of the build-in indexes and such is giving me the tits.

Perhaps my general formula code might require too many registers… but give it a try and see what happens External Image :)

At least it scales well ! External Image :)

Also to be able to determine the maximum memory allocation for the memory arrays this is also a handy formula:

FinalVolume = (BlockVolume * ThreadVolume);

Usage example

malloc/allocate/etc( …Pointer… , SizeOf(ElementType) * FinalVolume );

For even more info and functions about general indexing formula’s see this thread at the bottom: