General formula's for maximum scaling and flexibility Scaling and flexibility, grids, blocks, th

Hello,

I just thought about how to achieve maximum scaling and flexibility, while still offering a simple linear index so that kernels can easily use one final index to address data elements in a unique way in one large 1D array:

// general indexing formula’s:

ThreadWidth = BlockDim.x;
ThreadHeight = BlockDim.y;
ThreadDepth = BlockDim.z;

ThreadArea = ThreadWidth * ThreadHeight;
ThreadVolume = ThreadDepth * ThreadArea;

ThreadIndex = (ThreadIdx.z * ThreadArea) + (ThreadIdx.y * ThreadWidth) + ThreadIdx.x;

BlockWidth = GridDim.x;
BlockHeight = GridDim.y;
BlockDepth = GridDim.z;

BlockArea = BlockWidth * BlockHeight;
BlockVolume = BlockDepth * BlockArea;

BlockIndex = (BlockIdx.z * BlockArea) + (BlockIdx.y * BlockWidth) + BlockIdx.x;

FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

So usage example is:

call/launch example:

GridDim( BlocksInXDirection, BlocksInYDirection, BlocksInZDirection ); // XYZ = total blocks
BlockDim( ThreadsInXDirection, ThreadsInYDirection, ThreadsInZDirection ); // XYZ = total threads per block, must be less or equal to threads per block limit (see gpu specs)
KernelLaunch<<<GridDim,BlockDim>>>

// ^ (threads in total = total blocks * total threads per block (same as FinalVolume) )

Inside kernel:

MemoryCell[ FinalIndex ] = …;

Element[ FinalIndex ] = …;

These are just very big arrays/memories which can be excessed in a 6 dimensional way ! External Image :) thanks to the general indexing formula’s ! External Image

So now your problems/solutions should be able to scale up to 6 dimensions as far as the hardware allows and still offer easy 1D array programming ! External Image :)

Let me know how it works out for you ! External Image

Also to be able to determine the maximum memory allocation for the memory arrays this is also a handy formula:

FinalVolume = (BlockVolume * ThreadVolume);

Usage example:

malloc/allocate/getmem/etc( …Pointer… , SizeOf(ElementType) * FinalVolume );

Additional note:

Each “InDirection” in the GridDim and BlockDim should always at least be 1, otherwise the calculations will be zero-ed out.

To be able to use these formula’s in an arbitrary way to manipulate and calculate new linear indexes the following function could be used:

// could also be called 6Dto1D

int LinearIndexFrom6D
(
int ThreadX, int ThreadY, int ThreadZ, int BlockX, int BlockY, int BlockZ,
int ThreadWidth, ThreadHeight, ThreadDepth, BlockWidth, BlockHeight, BlockDepth
)
{

int ThreadArea = ThreadWidth * ThreadHeight;
int ThreadVolume = ThreadDepth * ThreadArea;

int ThreadIndex = (ThreadZ * ThreadArea) + (ThreadY * ThreadWidth) + ThreadX;


int BlockArea = BlockWidth * BlockHeight;
int BlockVolume = BlockDepth * BlockArea;

int BlockIndex = (BlockZ * BlockArea) + (BlockY * BlockWidth) + BlockX;

    int FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

return FinalIndex;

}

Once the 6D to 1D linear address is calculated it can then be converted back to any other multi-dimension.

For example 1D back to 3D for volume rendering:

// could also be called 1Dto3D

void 3DIndexFromLinear( LinearIndex, int &X, int %&Y, int &Z, int Width, int Height, int Depth )
{
int Area = Width * Height;

Z = LinearIndex / Area;
Y = (LinearIndex - (Z * Area)) / Width;
X = (LinearIndex - (Z * Area)) - (Y * Width);

}

Example usage:

FinalIndex = LinearIndex( ThreadIdx.x, ThreadIdx.y, TheadIdxz, BlockIdx.x, BlockIdx.y, BlockIdx.z );

3DIndexFromLinear( FinalIndex, VolumeX, VolumeY, VolumeZ );

Volume3D[ VolumeZ ] [ VolumeY ] [ VolumeX ] = …; // special 3d structure !, pointer array to pointer array to element array.

(All pseudo code untested but in theory should work).

Hello,

I just thought about how to achieve maximum scaling and flexibility, while still offering a simple linear index so that kernels can easily use one final index to address data elements in a unique way in one large 1D array:

// general indexing formula’s:

ThreadWidth = BlockDim.x;
ThreadHeight = BlockDim.y;
ThreadDepth = BlockDim.z;

ThreadArea = ThreadWidth * ThreadHeight;
ThreadVolume = ThreadDepth * ThreadArea;

ThreadIndex = (ThreadIdx.z * ThreadArea) + (ThreadIdx.y * ThreadWidth) + ThreadIdx.x;

BlockWidth = GridDim.x;
BlockHeight = GridDim.y;
BlockDepth = GridDim.z;

BlockArea = BlockWidth * BlockHeight;
BlockVolume = BlockDepth * BlockArea;

BlockIndex = (BlockIdx.z * BlockArea) + (BlockIdx.y * BlockWidth) + BlockIdx.x;

FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

So usage example is:

call/launch example:

GridDim( BlocksInXDirection, BlocksInYDirection, BlocksInZDirection ); // XYZ = total blocks
BlockDim( ThreadsInXDirection, ThreadsInYDirection, ThreadsInZDirection ); // XYZ = total threads per block, must be less or equal to threads per block limit (see gpu specs)
KernelLaunch<<<GridDim,BlockDim>>>

// ^ (threads in total = total blocks * total threads per block (same as FinalVolume) )

Inside kernel:

MemoryCell[ FinalIndex ] = …;

Element[ FinalIndex ] = …;

These are just very big arrays/memories which can be excessed in a 6 dimensional way ! External Image :) thanks to the general indexing formula’s ! External Image

So now your problems/solutions should be able to scale up to 6 dimensions as far as the hardware allows and still offer easy 1D array programming ! External Image :)

Let me know how it works out for you ! External Image

Also to be able to determine the maximum memory allocation for the memory arrays this is also a handy formula:

FinalVolume = (BlockVolume * ThreadVolume);

Usage example:

malloc/allocate/getmem/etc( …Pointer… , SizeOf(ElementType) * FinalVolume );

Additional note:

Each “InDirection” in the GridDim and BlockDim should always at least be 1, otherwise the calculations will be zero-ed out.

To be able to use these formula’s in an arbitrary way to manipulate and calculate new linear indexes the following function could be used:

// could also be called 6Dto1D

int LinearIndexFrom6D
(
int ThreadX, int ThreadY, int ThreadZ, int BlockX, int BlockY, int BlockZ,
int ThreadWidth, ThreadHeight, ThreadDepth, BlockWidth, BlockHeight, BlockDepth
)
{

int ThreadArea = ThreadWidth * ThreadHeight;
int ThreadVolume = ThreadDepth * ThreadArea;

int ThreadIndex = (ThreadZ * ThreadArea) + (ThreadY * ThreadWidth) + ThreadX;


int BlockArea = BlockWidth * BlockHeight;
int BlockVolume = BlockDepth * BlockArea;

int BlockIndex = (BlockZ * BlockArea) + (BlockY * BlockWidth) + BlockX;

    int FinalIndex = (BlockIndex * ThreadVolume) + ThreadIndex;

return FinalIndex;

}

Once the 6D to 1D linear address is calculated it can then be converted back to any other multi-dimension.

For example 1D back to 3D for volume rendering:

// could also be called 1Dto3D

void 3DIndexFromLinear( LinearIndex, int &X, int %&Y, int &Z, int Width, int Height, int Depth )
{
int Area = Width * Height;

Z = LinearIndex / Area;
Y = (LinearIndex - (Z * Area)) / Width;
X = (LinearIndex - (Z * Area)) - (Y * Width);

}

Example usage:

FinalIndex = LinearIndex( ThreadIdx.x, ThreadIdx.y, TheadIdxz, BlockIdx.x, BlockIdx.y, BlockIdx.z );

3DIndexFromLinear( FinalIndex, VolumeX, VolumeY, VolumeZ );

Volume3D[ VolumeZ ] [ VolumeY ] [ VolumeX ] = …; // special 3d structure !, pointer array to pointer array to element array.

(All pseudo code untested but in theory should work).

As a general question, do you know how fast the GridDim and BlockDim functions are? If I know that I have operations that require different numbers of threads, does it take a lot of overhead to use these functions 2 or 3 times? Also, if I use these to get, say, 256 threads then again for 128, do the remaining 128 threads get freed?

Thanks for your post, it is extremely useful!

Again you are posting this kind of extemely useful information! Keep up the good work! External Image You have really taken GPU computing to new levels! External Image

The gridDim and blockDim functions usually have a huge latency but thankfully the threadIdx.x function has been optimized by Nvidia to be really fast ( since it is used so frequently ).

Glad you appreciate it ! External Image :) (I am hoping that nvidia might notice it too and perhaps implement some special hardware logic to perform these kinda of calculations real fast… I am kinda surprised that these simple calculations/functions are not yet present… at least it would allow some basic distribution techniques… on the other hand every solution might need different indexing tricks External Image)

Anyway I haven’t tried these formula’s/functions out yet… I am just starting with cuda c (4.0)… though at least one or two other persons have tried these formula’s out… and they work External Image :)

I don’t know yet about the functions but they should work too.

If the functions and/or formula’s require too many registers or are too slow then one could try to replace some of the variables with the ThreadIdx.x and ThreadIdx.y and such… the formula’s will become longer and will do some extra multiplications, but might require less registers… perhaps in the future when I need these formula’s myself I might try that and update this posting thread External Image

For now I am busy converting my Code from Delphi to C++ External Image :)

I will do this for just one or two projects I guess… maybe I will then code something new in C++ or maybe I will then try to get a pascal to cuda/ptx compiler working… which would be a lot of work and would probably be doomed to fail since I never wrote a compiler before and I have no experience with the algorithms… but it would be nice to have… and I can try anyway… (with for example free pascal open source compiler).

It’s lots of work converting Delphi to C++… so far I am taking it easy/relaxing about it… it’s also a little bit fun since I haven’t programmed in C++ for a long time so I can now freshen up on my C++ skills External Image So far it’s going well… I do fear the lack of the “with” statement though, but nothing a copy & paste operation can’t fix External Image :) Also I wish cuda c/c++ had properties like Delphi but ok… so far Get/Set will do the job External Image

I also have some nice code available for automatically calculating the block and thread dimensions for a given problem size.

However it does require twice the code above and it’s pretty smart External Image :)

And I am not Santa Claus ! LOL ! External Image =D

So if anybody is interested I will make you the following deal:

  1. You purchase my UDP File Transfer or UDP Speed Test application activation keys and I will give you the pseudo code and real code for free. Added benefit for you is:
    you get two nice working products too, which will be updated in the future to work even better, activation codes will always remain working because I don’t believe in paying for updates ! External Image =D

To give you a sense of the usefullness of the code (which executes on the host !):

You simply specify:

// Problem distribution code (by Skybuck Flying yours truely ! External Image External Image
mProblemSize := … whatever big you want…;
mPartialProblemSolvingEnabled := … false / true…;

// Thread dimension/distribution code (by Skybuck Flying yours truely ! External Image External Image
mMaxThreadsPerBlock := … must be set to max thread per block device limitation …;

// problem dimension spread/distribution code
mMaxThreadWidth := … whatever you want…;
mMaxThreadHeight := … whatever you want…;
mMaxThreadDepth := … whatever you want…;

// Block dimension/distribution code (by Skybuck Flying yours truely ! External Image External Image
mMaxBlockWidth := … should be set to device max block width or less…;
mMaxBlockHeight := … should be set to device max block height or less …;
mMaxBlockDepth := … should be set to device max block depth or less…;

The code automatically calculates:

// adjusted if necessary.
mMaxThreadWidth
mMaxThreadHeight
mMaxThreadDepth

mMaxThreadArea
mMaxThreadVolume

mPartialProblemSolvingEnabled
mProblemSize
mProblemVolume
mRemainingProblemSize

mBlockCount

mBlockWidth
mBlockHeight
mBlockDepth

mBlockArea
mBlockVolume

mBlocksRemaining (already included in solution dimension, just extra info)

mThreadCount

mThreadWidth
mThreadHeight
mThreadDepth

mThreadArea
mThreadVolume

mThreadsRemaining (already included in solution dimension, just extra info)

Plus a technique how to make sure padded kernels/threads are not executed.

Products can be viewed here:

http://members.home.nl/hbthouppermans/Skybuck/

Activation keys can be bought directly here:

Ok it’s working again ! External Image =D

https://order.kagi.com/cgi-bin/store.cgi?storeID=P4X&&

External Image External Image External Image External Image External Image External Image External Image External Image External Image External Image External Image

I will automatically see your e-mail and will post code to that e-mail.

I’m stunned. How could I possibly have missed this business model?

pure genius, i will order right away! Keep a look out in your inbox!