Lets say I am using two kernels, First one generates Matrix M, and second one consumes the generated Matrix. So my code kind of looks like following.

```
cudaMalloc(M);
//Generate Matrix M
Generate_Matrix<<>>(M);
//Use matrix M to calculate result
Consume_Matrix<<>>(M);
//Memcopy results from device to host and print.
//Cuda Free mem
```

The problem is inside Consume_Matrix kernel, where is is just running a simple for loop to read all Matrix rows and sum it up.

```
for(int k =0 ; k < COL; k++)
{
result += M[i*COL + k];
}
```

Here is the issue. I get expected result for COL = 5000 and get result = 0 for COL = 9000

I made sure that Matrix M is getting populated correctly after first kernel, for any COL value. Problem is in second kernel call. Where it is unable to perform simple addition.

Checked for errors but there were none. Its not also Kernel synchronization problem as I am using GTX 280.

Can someone please help me here. I am going MAD. What did i miss? It feels like something is terribly wrong with my understanding.