Size limitation for 1D Arrays in CUDA?

If the maximum number of blocks in any dimension is 65535, and the maximum number of threads per block is 1024, does that the maximum number of array elements I should have is

1D : 65535 * 1024
2D : 65535 * 65535 * 1024

I ask this because in a calculation I am doing for a 1D array, if the number of elements goes above (65535 * 1024), is it impossible to calculate beyond that many using a single execution of 1D kernel?

Sorry if this a stupid and obvious question given that I cannot set more than <<<65535,1024>>>, but I just want to be sure before I move onto modifying my kernel that it turns into a 2D kernel for something like <<<dim3(dimX,dimY),1024>>>.


I believe (at least for compute capability>=3.0) the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.

I have used arrays of 1 billion elements with no problem,so if that is not enough then you need an additional GPU.

1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.


But remember that 2147483647 times anything more than 2 will not fit inside an unsigned 32 bit int,and you will have to use unsigned long long type.

Also keep that in mind when allocating memory, especially with 64 bit types.

Also CUDA-Z will give you this specific information about all your connected GPUs.

Hi blade,
It seems to me you maybe haven’t quite got the logic of cuda programming just yet (don’t worry, until very recently I hadn’t either! :P)

Very rarely will only 1 element be calculated per thread - in fact this mostly leads to quite slow code. In fact, each thread should work on 10+ elements - but can calculate as many elements as you wish. The only thing to worry about is launch overhead (each block carries a certain fixed cost of initialization time) versus occupancy (the amount of blocks you need to launch to “max out” the processors. In a titan this would be ~= 14*2048 =28672 threads, in other words, 56 blocks of 512 threads, or even 112 blocks of 256 threads. I would consider launching any more than 50,000 threads wasteful and under optimized. So to answer your question, there is no size limitation.

If you are calculating

void main()
out[i]=in_a[i]*in_b[i] ;

for example, try this instead

void main()
for (int i=0; i<10; i++) out[i]=in_a[i]*in_b[i];


This limitation for the number of blocks is for cc 2.0. For cc 3.x you can millions. One of the optimizations is to sometime you 1 thread wit more than 1 element in order to optimize the memory transfers.

When I go over 65535 blocks, I get “CUDA unknown error” and the calculation aborts. For example, in this code.

numBlocks = ceil(N/1024) = 62500; OK
numBlocks = ceil(N/512) = 125000; Gives error and aborts

I have CUDA 5.5 installed, and doing all of this through Visual Studio Pro 2012. GTX Titan Nvida 327.23 drivers Windows 7 Pro 64-bit, 32gb RAM, if that helps you or anyone else out.

I just want to point out, a lot of my confusion is that it seems the terms “blocks”, “grids”, and “threads” are being used interchangeably around the internet. I honestly don’t even know where the term “grids” fits into any of this, or if it even does for a 1D array. I see as : I have an array N elements long I divide up into n blocks. Each of these blocks contain q elements. So assign numBlocks = n, and numThreads = q, and try to split up the array N such that I get 1024/512/256/etc elements per block. But anyway, I see sample codes with kernel executions like <<<numBlocks,numThreads>>> or <<<gridSize,blockSize>>> and it just confuses me.

I think I get what you are suggesting, but I thought queuing up blocks was the same thing as having each thread end up calculating more than one element. Each thread should actually do roughly N/28,672 calculations after all blocks have been exhausted, or is that incorrect? My belief is that, by queuing up blocks like I do, I may waste some blocks at the end, but when I’m going through tens of thousands of blocks, it’s not going to effect compute time much? Unless the queuing of blocks themselves is adding to compute time?

For me to have each thread calculate more than one element in the kernel in a “single go” is not something I have really thought about. It will be a challenge. The reason is that my kernel is setup to do the following right now…

  1. index = threadIdx.x+blockIdx.x*blockDim.x
  2. use device function to obtain i, j, k coordinates from index
  3. use device function and i, j, k, coordinates to obtain 1D array indices for (i-1),(i+1),(j-1),(j+1),(k-1),(k+1) and assign them each to their own variable
  4. if(index < NxNyxNz) {Anew[index] = … Aold(index,(i-1),(i+1),(j-1),(j+1),(k-1),(k+1))}

Now I think it’s definitely possible, but I’ll have to give it some thought. But compute time is not something that is bothering me. I’ll give it a shot anyway though, as it would prevent me from ever reaching such a block limit that I seem to be having. Now I’m more worried as to why I’m having such a limit as everyone so far has suggested I shouldn’t be.

I think you have some other errors on your code, as a Titan should easily handle large arrays.

If you are trying to implement some type of graph or dynamic programming algorithm make sure you avoid race conditions.

I work with these types of problems all the time, and here is an example of how I handle them:

Well, I don’t believe I am running into any sort of physical limitation with Titan. Last night I made a 2D kernel modified version of my code.

So my call was modified:

numThreads = 1024;
numBlocks = ceil(N/numThreads);
a = numBlocks;
b = ceil(sqrt(numBlocks));

original call : <<<a,numThreads>>>
new call: <<<dim3(b,b),numThreads>>>

Then I utilized
0) ix = threadIdx.x+blockIdx.xblockDim.x
iy = threadIdx.y+blockIdx.y

  1. index = some function of ix and iy I can’t remember off the top of my head
  2. use device function to obtain i, j, k coordinates from index
  3. use device function and i, j, k, coordinates to obtain 1D array indices for (i-1),(i+1),(j-1),(j+1),(k-1),(k+1) and assign them each to their own variable
  4. if(index < NxNyxNz) {Anew[index] = … Aold(index,(i-1),(i+1),(j-1),(j+1),(k-1),(k+1))}

So the same exact thing as before, with just an extra step to utilize a second dimension in the kernel. It seemed to work from quick debug purpose calculated output values in the command window, but I haven’t confirmed with full numerical output. But it at least ran a calculation with 128M elements, something I could not do before (in the first case, limited to 65535*1024 elements). VRAM usage was around 3.7gb with the 128M element calculation.

It is not enough to use a Titan card you also have to use the -arch=sm 30 or sm_35 to be able ot use more than 6500 blocks.

I usually use this method for 2D indexing:

const int THREADS=256;

dim3 dimGrid((m+THREADS-1)/THREADS,n,1);


then indexing like this;

const int i=threadIdx.x+blockIdx.x*blockDim.x;
const int j=blockIdx.y;

THREADS stays 1-D, dimGrid is 2D. This way you get ‘i’ ranging from 0 to m-1 (returning if not divisible by THREADS and >= m), and ‘j’ ranges from 0 to n-1 with no extra(due to block indexing).

In this example 1024*3333=3,412,992 elements in a 2D array.

Also make sure you cast your number to 32 bit ints when doing division.


Grids Blocks and Threads are hierarchical

On a gpu you have a grid of blocks of threads. The thread is the smallest unit (like a city), the block is the next up (a state), and the grid is the overarching structure (a country)

There is only one “grid” per kernel - that is its entire structure. It is a 3 dimensional array of blocks. Your grid is (65535,1,1)

The blocks are basically a set of threads, with the restriction that a block will not execute until its entirety will fit on the gpu at once. The reason for blocks are essentially the hardware/architecture setup. Blocks can also be three dimensional, but most commonly are one dimensional. For example (512,1,1) or (256,1,1).

Threads are the actual processing power. Each thread works through its code top-to-bottom and then returns.