Hi. I have been developing computational library for Windows (uses CUDA abilities) under VS2010. I use CUDA runtime API.

I met unexpected problem with processing of large arrays. There is a code of simple long operation test I use to explain the problem.

```
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int blocksize = 16;
__global__ void performOperation(float* A, int colsA, int rowsA, float4* B,
int lenB) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
return;
int inx = i + j * colsA;
for (int i = 0; i < lenB; i++) {
A[inx] += B[i].x + B[i].y + B[i].z;
}
}
__global__ void fillA(float* A, int colsA, int rowsA) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
return;
int inx = i + j * colsA;
A[inx] = 0.0f;
}
__global__ void fillB(float4* B, int lenB) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (!(((0 <= i) && (i < lenB))))
return;
B[i].x = 1.0f;
B[i].y = 1.0f;
B[i].z = 1.0f;
}
int main(void) {
int colsA = 1000;
int rowsA = 1000;
size_t sizeA = colsA * rowsA * sizeof(float);
float* dvA;
cudaMalloc((void**) &dvA, sizeA);
int lenB = 20000;
size_t sizeB = lenB * sizeof(float4);
float4* dvB;
cudaMalloc((void**) &dvB, sizeB);
dim3 dimBlock1(blocksize, blocksize);
dim3 dimGrid1(ceil(float(colsA) / float(dimBlock1.x)),
ceil(float(rowsA) / float(dimBlock1.y)));
fillA<<<dimGrid1, dimBlock1>>>(dvA, colsA, rowsA);
cudaThreadSynchronize();
dim3 dimBlock2(blocksize * blocksize);
dim3 dimGrid2(ceil(float(lenB) / float(dimBlock2.x)));
fillB<<<dimGrid2, dimBlock2>>>(dvB, lenB);
cudaThreadSynchronize();
performOperation<<<dimGrid1, dimBlock1>>>(dvA, colsA, rowsA, dvB, lenB);
cudaThreadSynchronize();
float* a = new float[colsA * rowsA];
cudaMemcpy(a, dvA, sizeA, cudaMemcpyDeviceToHost);
for(int i = 0; i < 100; i++)
{
printf("%i %f \n", i, a[i]);
}
delete[] a;
cudaFree(dvA);
cudaFree(dvB);
return 0;
}
```

The task is to process two dimesional array of float (int cols х rows) within the dependence on the linear array of float4 (int len). The len of the float4 array have to be greater or equal 20000 elements and less or equal 100000 elements. This is significant condition. The size of float array have to be approximately 1500 x 1500.

The maximum size of the float4 array is 100000 x 16 Bytes ~ 1,5 MBytes

The maximem size of the float array is 1500 x 1500 x 4 Bytes ~ 8 MBytes

In order to estimate CUDA abilities I wrote the program represented in the code I wrote above.

colsA & rowsA - responsable for the size of two dimesional float array

dvA - device representation of two dimesional float array

lenB - responsable for the length of float4 array

dvB - device representation of float array

There is a critical section of the code represented in the global function “performOperation”:

```
__global__ void performOperation(float* A, int colsA, int rowsA, float4* B,
int lenB) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (!(((0 <= i) && (i < colsA)) && ((0 <= j) && (j < rowsA))))
return;
int inx = i + j * colsA;
for (int i = 0; i < lenB; i++) {
A[inx] += B[i].x + B[i].y + B[i].z;
}
```

The idea is that I have to use all values of the array B in order to compute correct value of array A;

My system crashes when I try to start the programm with the indicated parameters (1500 x 1500 x 20000). I was testing this code under GeForce 9800 GT 1GB and driver v 306.94. Also I was testing this code under linux Ubuntu 12.04 on the same computer. The programm stops crashing when I significantly reducing the size of arrays! I can’t understand what is wrong.

I’ll be so thanksfull to every one who will show me the right way to think. I have no idea what to do. Just can say that I tryed alternative code which uses 4 lienar arrays instead of one float4 array. Ofcause the result is same.