Size limitation for 1D Arrays in CUDA?

blade613x · October 16, 2013, 2:53am

If the maximum number of blocks in any dimension is 65535, and the maximum number of threads per block is 1024, does that the maximum number of array elements I should have is

1D : 65535 * 1024
2D : 65535 * 65535 * 1024

I ask this because in a calculation I am doing for a 1D array, if the number of elements goes above (65535 * 1024), is it impossible to calculate beyond that many using a single execution of 1D kernel?

Sorry if this a stupid and obvious question given that I cannot set more than <<<65535,1024>>>, but I just want to be sure before I move onto modifying my kernel that it turns into a 2D kernel for something like <<<dim3(dimX,dimY),1024>>>.

Thanks.

CudaaduC · October 16, 2013, 3:31am

I believe (at least for compute capability>=3.0) the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.

I have used arrays of 1 billion elements with no problem,so if that is not enough then you need an additional GPU.

1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.

<<<(976563,1,1),(1024,1,1)>>>

But remember that 2147483647 times anything more than 2 will not fit inside an unsigned 32 bit int,and you will have to use unsigned long long type.

Also keep that in mind when allocating memory, especially with 64 bit types.

Also CUDA-Z will give you this specific information about all your connected GPUs.

sBc-Random · October 16, 2013, 6:04am

Hi blade,
It seems to me you maybe haven’t quite got the logic of cuda programming just yet (don’t worry, until very recently I hadn’t either! :P)

Very rarely will only 1 element be calculated per thread - in fact this mostly leads to quite slow code. In fact, each thread should work on 10+ elements - but can calculate as many elements as you wish. The only thing to worry about is launch overhead (each block carries a certain fixed cost of initialization time) versus occupancy (the amount of blocks you need to launch to “max out” the processors. In a titan this would be ~= 14*2048 =28672 threads, in other words, 56 blocks of 512 threads, or even 112 blocks of 256 threads. I would consider launching any more than 50,000 threads wasteful and under optimized. So to answer your question, there is no size limitation.

If you are calculating

void main()
{
out[i]=in_a[i]*in_b[i] ;
}

for example, try this instead

void main()
{
for (int i=0; i<10; i++) out[i]=in_a[i]*in_b[i];
}

pasoleatis · October 16, 2013, 8:32am

Hello,

This limitation for the number of blocks is for cc 2.0. For cc 3.x you can millions. One of the optimizations is to sometime you 1 thread wit more than 1 element in order to optimize the memory transfers.

blade613x · October 16, 2013, 3:32pm

When I go over 65535 blocks, I get “CUDA unknown error” and the calculation aborts. For example, in this code.

numBlocks = ceil(N/1024) = 62500; OK
numBlocks = ceil(N/512) = 125000; Gives error and aborts

I have CUDA 5.5 installed, and doing all of this through Visual Studio Pro 2012. GTX Titan Nvida 327.23 drivers Windows 7 Pro 64-bit, 32gb RAM, if that helps you or anyone else out.

sBc-Random:

Hi blade,
It seems to me you maybe haven’t quite got the logic of cuda programming just yet (don’t worry, until very recently I hadn’t either! :P)

Very rarely will only 1 element be calculated per thread - in fact this mostly leads to quite slow code. In fact, each thread should work on 10+ elements - but can calculate as many elements as you wish. The only thing to worry about is launch overhead (each block carries a certain fixed cost of initialization time) versus occupancy (the amount of blocks you need to launch to “max out” the processors. In a titan this would be ~= 14*2048 =28672 threads, in other words, 56 blocks of 512 threads, or even 112 blocks of 256 threads. I would consider launching any more than 50,000 threads wasteful and under optimized. So to answer your question, there is no size limitation.
}

I just want to point out, a lot of my confusion is that it seems the terms “blocks”, “grids”, and “threads” are being used interchangeably around the internet. I honestly don’t even know where the term “grids” fits into any of this, or if it even does for a 1D array. I see as : I have an array N elements long I divide up into n blocks. Each of these blocks contain q elements. So assign numBlocks = n, and numThreads = q, and try to split up the array N such that I get 1024/512/256/etc elements per block. But anyway, I see sample codes with kernel executions like <<<numBlocks,numThreads>>> or <<<gridSize,blockSize>>> and it just confuses me.

I think I get what you are suggesting, but I thought queuing up blocks was the same thing as having each thread end up calculating more than one element. Each thread should actually do roughly N/28,672 calculations after all blocks have been exhausted, or is that incorrect? My belief is that, by queuing up blocks like I do, I may waste some blocks at the end, but when I’m going through tens of thousands of blocks, it’s not going to effect compute time much? Unless the queuing of blocks themselves is adding to compute time?

For me to have each thread calculate more than one element in the kernel in a “single go” is not something I have really thought about. It will be a challenge. The reason is that my kernel is setup to do the following right now…

index = threadIdx.x+blockIdx.x*blockDim.x
use device function to obtain i, j, k coordinates from index
use device function and i, j, k, coordinates to obtain 1D array indices for (i-1),(i+1),(j-1),(j+1),(k-1),(k+1) and assign them each to their own variable
if(index < NxNyxNz) {Anew[index] = … Aold(index,(i-1),(i+1),(j-1),(j+1),(k-1),(k+1))}

Now I think it’s definitely possible, but I’ll have to give it some thought. But compute time is not something that is bothering me. I’ll give it a shot anyway though, as it would prevent me from ever reaching such a block limit that I seem to be having. Now I’m more worried as to why I’m having such a limit as everyone so far has suggested I shouldn’t be.

CudaaduC · October 16, 2013, 5:22pm

blade613x:

CudaaduC:

I believe (at least for compute capability>=3.0) the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.

I have used arrays of 1 billion elements with no problem,so if that is not enough then you need an additional GPU.

1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.

<<<(976563,1,1),(1024,1,1)>>>

When I go over 65535 blocks, I get “CUDA unknown error” and the calculation aborts. For example, in this code.

numBlocks = ceil(N/1024) = 62500; OK
numBlocks = ceil(N/512) = 125000; Gives error and aborts

I have CUDA 5.5 installed, and doing all of this through Visual Studio Pro 2012. GTX Titan Nvida 327.23 drivers Windows 7 Pro 64-bit, 32gb RAM, if that helps you or anyone else out.

sBc-Random:

Hi blade,
It seems to me you maybe haven’t quite got the logic of cuda programming just yet (don’t worry, until very recently I hadn’t either! :P)

Very rarely will only 1 element be calculated per thread - in fact this mostly leads to quite slow code. In fact, each thread should work on 10+ elements - but can calculate as many elements as you wish. The only thing to worry about is launch overhead (each block carries a certain fixed cost of initialization time) versus occupancy (the amount of blocks you need to launch to “max out” the processors. In a titan this would be ~= 14*2048 =28672 threads, in other words, 56 blocks of 512 threads, or even 112 blocks of 256 threads. I would consider launching any more than 50,000 threads wasteful and under optimized. So to answer your question, there is no size limitation.
}

I just want to point out, a lot of my confusion is that it seems the terms “blocks”, “grids”, and “threads” are being used interchangeably around the internet. I honestly don’t even know where the term “grids” fits into any of this, or if it even does for a 1D array. I see as : I have an array N elements long I divide up into n blocks. Each of these blocks contain q elements. So assign numBlocks = n, and numThreads = q, and try to split up the array N such that I get 1024/512/256/etc elements per block. But anyway, I see sample codes with kernel executions like <<<numBlocks,numThreads>>> or <<<gridSize,blockSize>>> and it just confuses me.

I think I get what you are suggesting, but I thought queuing up blocks was the same thing as having each thread end up calculating more than one element. Each thread should actually do roughly N/28,672 calculations after all blocks have been exhausted, or is that incorrect? My belief is that, by queuing up blocks like I do, I may waste some blocks at the end, but when I’m going through tens of thousands of blocks, it’s not going to effect compute time much? Unless the queuing of blocks themselves is adding to compute time?

For me to have each thread calculate more than one element in the kernel in a “single go” is not something I have really thought about. It will be a challenge. The reason is that my kernel is setup to do the following right now…

index = threadIdx.x+blockIdx.x*blockDim.x

use device function to obtain i, j, k coordinates from index

use device function and i, j, k, coordinates to obtain 1D array indices for (i-1),(i+1),(j-1),(j+1),(k-1),(k+1) and assign them each to their own variable

if(index < NxNyxNz) {Anew[index] = … Aold(index,(i-1),(i+1),(j-1),(j+1),(k-1),(k+1))}

Now I think it’s definitely possible, but I’ll have to give it some thought. But compute time is not something that is bothering me. I’ll give it a shot anyway though, as it would prevent me from ever reaching such a block limit that I seem to be having. Now I’m more worried as to why I’m having such a limit as everyone so far has suggested I shouldn’t be.

I think you have some other errors on your code, as a Titan should easily handle large arrays.

If you are trying to implement some type of graph or dynamic programming algorithm make sure you avoid race conditions.

I work with these types of problems all the time, and here is an example of how I handle them:

https://github.com/OlegKonings/CUDA_Dynamic_Programming_Example_6/blob/master/EXP3/EXP3/EXP3.cu

blade613x · October 16, 2013, 6:18pm

Well, I don’t believe I am running into any sort of physical limitation with Titan. Last night I made a 2D kernel modified version of my code.

So my call was modified:

numThreads = 1024;
numBlocks = ceil(N/numThreads);
a = numBlocks;
b = ceil(sqrt(numBlocks));

original call : <<<a,numThreads>>>
new call: <<<dim3(b,b),numThreads>>>

Then I utilized
0) ix = threadIdx.x+blockIdx.xblockDim.x
iy = threadIdx.y+blockIdx.yblockDim.y

index = some function of ix and iy I can’t remember off the top of my head
use device function to obtain i, j, k coordinates from index
use device function and i, j, k, coordinates to obtain 1D array indices for (i-1),(i+1),(j-1),(j+1),(k-1),(k+1) and assign them each to their own variable
if(index < NxNyxNz) {Anew[index] = … Aold(index,(i-1),(i+1),(j-1),(j+1),(k-1),(k+1))}

So the same exact thing as before, with just an extra step to utilize a second dimension in the kernel. It seemed to work from quick debug purpose calculated output values in the command window, but I haven’t confirmed with full numerical output. But it at least ran a calculation with 128M elements, something I could not do before (in the first case, limited to 65535*1024 elements). VRAM usage was around 3.7gb with the 128M element calculation.

pasoleatis · October 16, 2013, 9:47pm

It is not enough to use a Titan card you also have to use the -arch=sm 30 or sm_35 to be able ot use more than 6500 blocks.

CudaaduC · October 16, 2013, 9:54pm

I usually use this method for 2D indexing:

m=1024;
n=3333;
const int THREADS=256;

dim3 dimGrid((m+THREADS-1)/THREADS,n,1);

<<<dimGrid,THREADS>>>

then indexing like this;

const int i=threadIdx.x+blockIdx.x*blockDim.x;
const int j=blockIdx.y;

THREADS stays 1-D, dimGrid is 2D. This way you get ‘i’ ranging from 0 to m-1 (returning if not divisible by THREADS and >= m), and ‘j’ ranges from 0 to n-1 with no extra(due to block indexing).

In this example 1024*3333=3,412,992 elements in a 2D array.

Also make sure you cast your number to 32 bit ints when doing division.

sBc-Random · October 17, 2013, 12:45am

Blade:

Grids Blocks and Threads are hierarchical

On a gpu you have a grid of blocks of threads. The thread is the smallest unit (like a city), the block is the next up (a state), and the grid is the overarching structure (a country)

There is only one “grid” per kernel - that is its entire structure. It is a 3 dimensional array of blocks. Your grid is (65535,1,1)

The blocks are basically a set of threads, with the restriction that a block will not execute until its entirety will fit on the gpu at once. The reason for blocks are essentially the hardware/architecture setup. Blocks can also be three dimensional, but most commonly are one dimensional. For example (512,1,1) or (256,1,1).

Threads are the actual processing power. Each thread works through its code top-to-bottom and then returns.

Topic		Replies	Views
Grid dimensions CUDA Programming and Performance	6	5415	September 18, 2009
Max blocks per grid CUDA Programming and Performance	3	14706	August 3, 2009
Probably a simple answer Simple CUDA code - unexpected result CUDA Programming and Performance	7	4852	October 27, 2010
Max Dimension of GridSize and BlockSize CUDA Programming and Performance	8	10106	June 19, 2011
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009
Maximum number of threads on thread block CUDA Programming and Performance	12	72126	September 21, 2023
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
Thread Number Limitation CUDA Programming and Performance	3	3887	December 22, 2008

Size limitation for 1D Arrays in CUDA?

Related topics