LARGE 2D arrays

hi!

quick question: why is my kernel running correct on a matrix 32x1000, but wont run on a matrix 32x1000000?

does it have something to do with the hardware limitations?

It might be; how much total size is the matrix? And what are your kernel invocation dimensions? i.e. are you scaling your threads and block sizes to the matrix dimensions? And when you say don’t run, does that mean produce incorrect results, or straight-up crash?

total size is 32x1000000
im allocating it with cudaMallocPitch and copying it to device with cudaMemcpy2D (if that info helps)
one of the kernels runs smoothly producing the needed result with kernel invoication dimentions <<<height, width>>> (10^6 and 32 respectivly)

second kernel is really buggy for some reason - im calling it with the same dimentions as my first kernel - but it won’t crash or produce any result - in fact when i ran it through the debugger creating a break point @ my second kernel - cuda-gdb won’t even break. so im assuming the result is garbage - since nothing is being produced; it simply skips it.

one more detail - when i run the second kernel on a height of 10^4 it produces somewhat the needed result but really slow compared to the first kernel.

any help regarding this issue will help! if you want to check out my kernel code - i can post it, but i doubt that the problem lies within the kernel code

I believe alrikai meant to know by the first question: How many bytes of data are you allocating?

-Jeremiah

Yeah, I was more wondering about the # bytes, but your problem is in how you’re invoking your kernel. You say you’re doing “kernel invoication dimentions <<<height, width>>> (10^6 and 32 respectivly)”, but the values between the <<<…>>> don’t represent height and width, but rather are the grid and block dimensions to use for the kernel. Specifically, they are the x, y, and z sizes of the grid and block. In your case, you’re putting 10^6x1x1 as the grid’s dimension (meaning 10^6 blocks in the x-dimension), and 32x1x1 as the block’s dimension (meaning 32 threads in every block in the x-dimension).
I have a GTX 580, and from running the SDK “Device Query” program, my card’s max grid dimensions are 65536 x 65536 x 1. Odds are, your card isn’t too happy about being asked to have 10^6 blocks/grid in the x-dimension.

i guess that was the issue although my computability supported by the GPU is 1.2 and as far as i know judging by the info provided by deviceProp i am not exceeding the limitations.

im allocating the matrix with WHsizeof(int) [if that helps]

Do you have error checking in your code? Did you check for errors right after the second kernel launch? One reason why cuda-gdb did not break on the kernel could be that the second kernel did not even launch.

Oh no I didn’t mean you were running out of device memory, but that your kernel invocation’s thread and block dimensions are invalid. You mentioned earlier your kernel was called as <<<height, width>>>, where height and width were of your matrix, which will give you trouble when those values exceed 65536 and 1024 respectively (this can vary depending on your card). Perhaps it would be more illustrative if you posted the relevant code

ok here’s my code:

(i’ve reworked it and now it has even lower bounds in order to produce the correct result)

#define W 8

#define H 512

/* Kernel base taken from the NVIDIA website 

 only thing added is the ability to pull   

 out rows from the 2D array and treat it as 1D arrays.

 two problems arise here for me besides the fact

 that it wont work on large 2D arrays.

 1) is its limitation to the width of the array

 (i.e. aside the powers of two) works correctly

 only with width set to 8;

 2) even if i keep 8 to be the width, height is

 an issue too - ie trial-and-error proved the 

 limit to be 775. why? i have no idea */

__global__ void action ( int *input, int *output, unsigned int n )

{

 for (int x = 0; x < H; x++)

 {

  int *row = (int*)((char**)input + x * n/2); //pull out a row from the 2D array

  __shared__ int s_data[W];                   //put it in shared mem

unsigned int tid = threadIdx.x;

  unsigned int i = (blockIdx.x * blockDim.x) + tid;

  s_data[tid] = (i < n)? row[i] : 0;

  __syncthreads();

for ( int s = blockDim.x/2; s > 0; s>>=1)

  {

   if ( tid < s )

   {

    s_data[tid] += s_data[tid + s];

    __syncthreads();

   }

   __syncthreads();

  }

  if (tid == 0)

  {

   output[x] = s_data[0];

  }

 }

}

int main ()

{

 int *dev_a, *dev_b;

 int input[H][W], output[H];

 /* populate the input with random numbers */

 dim3 dimBlock(W);

 dim3 dimGrid (H/dimBlock.x);

 /* allocate the memory in GPU using cudaMalloc */

 /* copy the input 2D array to dev_a using a regular cudaMemcpy */

action <<<dimGrid, dimBlock>>> (dev_a, dev_b, W);

/* copy from dev_b and store in output array */

 /* free up dev_a, dev_b */

 return 0;

}

i had to rework the previous version that was behaving unexpectedly

and now i have this version reworked and revised that im struggling with

please tell me what am i doing wrong?

aside from this issue i had other programs that work perfectly fine on toy arrays of small sizes.

as soon as i try to experiment with larger arrays modifying their width (up to 512 - max threads supported by my hardware) or height (up and above 10^6) they stop acting properly producing ‘0’ results.

p.s. once more sorry for my newbieness - i was taught to think sequentially - and for that reason im having trouble to better understand the idea behind parallel programming, perhaps that’s the root of most problems in my code.

Looking at you code I notice

#define W 8

#define H 512

dim3 dimBlock(W);

 dim3 dimGrid (H/dimBlock.x);

This will create dimBlock and dimGrid with the following values

dimBlock.x = 8; dimBlock.y = 1; dimBlock.z = 1; // 8 threads per block

dimGrid.x = 64; dimGrid.y = 1; dimGrid.z = 1; // 64 blocks in the grid

NB dimBlock is passed to kernel so these values will be in blockDim, similarly for dimGrid

(Ideally the number of threads per block should be a multiple of 32, but will still run with 8)

As for largest size of grid, dim3 dimGrid(65535,65535,1); would work. i.e. 65535*65535 blocks. But dont try it just yet.

You have a problem here

for ( int s = blockDim.x/2; s > 0; s>>=1)  // this becomes (int s = 4; s >0; s>>=1)

  {

   if ( tid < s )

   {

    s_data[tid] += s_data[tid + s];

    __syncthreads();

   }

   __syncthreads();

  }

Check what the manual says about __syncthreads(), it will say something about it shouldn’t be used where some threads will execute it and others won’t.

However the for loop becomes for (int s = 4; s >0; s>>=1)

So the if test becomes if ( tid < 4) on the first loop, and only half the threads will execute the 1st __synchthreads, and that according to the manual will cause problems. remove that __synchthreads, as the very next instruction is a __synchthreads anyway

The reason that it was working correctly when you reduced the number of threads per block to 8 is that your GPU has 8 arithmetic units per SM. So 8 threads actually execute the same instruction in the same clock cycle. (same instruction different data)

See Figure 1-2 in NVIDIA CUDA C Programming Guide Version 4. and compares a 4 core CPU with an 8 SM GPU ( NB this diagram is for a later GPU than yours as it has 16 ALU per SM)

The way you are pulling the row out is not good for a GPU. The code has every thread in the block making a copy of the row.

On a GPU you want each thread to copy 1 value from the input to the shared array, so that all N threads in the block working in together copy N values from the input to the shared array. Got to think of threads working as a team. You dont need to use row, just copy direct from input to s_data.

Hope that helps,

Cheers,

kbam

guys, thank you for your prompt replies!
@kbam: kinda cleared some stuff for me! indeed the first __syncthreads() is not needed;
but unfortunately that’s not the issue, my test program still produces 2d array of 0’s when expanding the size (height or width) in fact imho “pulling out the rows” might produce the error…
oh well - that example i will put away for a few, due that i don’t need it in my program at the moment; but im looking forward to solve it