2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it’s 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

```
#include <iostream>
#include <vector>
const int max_i = 2336;
const int max_j = 1752;
const int max_m = 16;
const int max_l = 4;
__global__ void kernel(bool * found, int total)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;
int z = threadIdx.z;
int l = z % max_l;
int m = z / max_l;
// check out of bounds.
if (i >= max_i)
return;
if (j >= max_j)
return;
if (l >= max_l)
return;
if (m >= max_m)
return;
int p = i * max_j * max_m * max_l
+ j * max_m * max_l
+ m * max_l
+ l;
if (p < total)
found[p] = true;
}
int main()
{
int total = max_i * max_j * max_m * max_l;
bool * found = (bool *) malloc(total);
std::cout << "total = " << total << "\n";
bool * d_found = 0;
cudaError_t e2 = cudaMalloc(&d_found, sizeof(bool) * total);
if (d_found == 0)
return 1;
if (e2 != 0)
return 1;
dim3 threads(16, 16);
int dim_x = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );
int dim_y = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );
int dim_z = max_m * max_l;
dim3 grid(dim_x, dim_y, dim_z);
kernel<<<grid, threads>>>(d_found, total);
cudaDeviceSynchronize();
cudaError_t e = cudaGetLastError();
if (e)
std::cout << cudaGetErrorString(e) << "\n";
void * ptr = (void*) found;
cudaMemcpy(ptr, d_found, 256, cudaMemcpyDeviceToHost);
for (int d = 0; d < 256; ++d)
std::cout << "d = " << d << " " << found[d] << "\n";
std::cout << "\n";
return 0;
}
```

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that “>>4” code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There’s going to be some unused threads, but that shouldn’t be a problem. BTW, the kernel has returns. That isn’t exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the “return;”. Something to get use to.

Now for the m_coeff array… Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do “strip mining”, i.e., do more work per thread, and represent partial sums with a smaller array. In that case, “l” could be mapped to z, and in that kernel, do a for-loop over “m”. If that still is too big a space, then you could map “m” to z, and do a for-loop over “l”. As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

Ken