cuda program crashing with kind of large stream

I have been experamenting with how large I can make my streams with my kenrels still executing correctly. I have been having a few problems:

  1. if I run the code through a shell in X, X pegs out a core and esentially my computer frezes. I am running the code on the same GPU that X is running on.

  2. running in run level 3 with no GUI, the system runs much happier, but for some reason I can only create streams that are about 1/3 the size of global memory before my kernels start to fail and not do the computations. It seems like that I should be able to create a stream that is less than but almost the size of global memory without problems. But that isn’t the case this time. Why not?

Anyway, this is the output:

global memory (B) : 804585472

run 1:

stream size (elements), stream size (B), % of global memory, computed pi, time/block (s)

16770560, 268328960, 33.349963, 3.141593, 0.427246, 0.000007

16836096, 269377536, 33.480288, 0.000000, 0.343174, 0.000005

run 2:

16770560, 268328960, 33.349963, 3.141593, 0.424878, 0.000006

16836096, 269377536, 33.480288, 0.000000, 0.340525, 0.000005

run 3:

16770560, 268328960, 33.349963, 3.141593, 0.427977, 0.000007

16836096, 269377536, 33.480288, 0.000000, 0.344303, 0.000005

I think you are using a 1D grid block with 512 threads per block.
The maximum size for the grid is 32768, and if you are using a simple 1:1 mapping, your approach will work for stream up to 32768*512=16777216.

You need to change the mapping (each thread needs to process more than one element of the stream), look at the Black-Scholes example in the SDK.

I am using 1D grid block with 256 threads per block.

#define BLOCK_SIZE 256


numBlocks = numElements / BLOCK_SIZE;

if(numBlocks * BLOCK_SIZE < numElements) numBlocks++;

calcPi<<<numBlocks, BLOCK_SIZE>>>(devMem);

I think I see what you mean. In stead of doing a 1:1 mapping I could do something like 1:4 such that the first thread really would work on the first four elements of the stream instead of just the first element. Then I would get to be able to do streams four times bigger.

Question: so is 32768 the max grid size period even if I use a 2D grid for example?

It is in the Programming Guide and from my memory it is 32768 in x and y direction, so that would get you quite a way.

I the grid from 1D to 2D, with some improvement. It looks like I can at best double the size of the stream that way to about 2/3 of the shared memory. I let Dg.y = 2 to get that result, but after that, making Dg.y bigger dosn’t let me make my stream bigger. I am baiscally stuck at 33554432 stream elements. Any Ideas?

               Dg.x = numBlocks / 16;

                Dg.y = 16;

               result = (float4*)calloc(sizeof(float4), numElements);

                if(cudaMalloc((void**)&devMem, numElements * sizeof(float4))){

                        printf("error: could not cudaMalloc\n");


                init<<<Dg, BLOCK_SIZE>>>(devMem);

                calcPi<<<Dg, BLOCK_SIZE>>>(devMem);
33547776, 536764416, 66.713163, 3.141593, 0.848888, 0.000006

33613312, 537812992, 66.843488, 0.000000, 0.847316, 0.000006