I have this device configuration
There is 1 device supporting CUDA
Device 0: “GeForce 8600M GS”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268435456 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.00 GHz
Concurrent copy and execution: No
Test PASSED
After running this program:
#include <stdio.h>
#include <cuda.h>
#include <cutil.h>
global void F(int n, float *a)
{
int index = gridDim.x * blockIdx.x + threadIdx.x;
if (index < n) a[index] = 7;
}
int main()
{
float a_h, a_d;
int n = (1 << 25);
printf(“Running F over %d elements\n”, n);
int size = n * sizeof(float);
printf(“Allocating host memory\n”);
a_h = (float)malloc(size);
printf(“Success\n”);
printf(“Allocating device memory\n”);
cudaMalloc((void*)&a_d, size);
printf(“Success\n”);
printf(“Filling host memory\n”);
for (int i = 0; i < n; i++) a_h[i] = 1;
printf(“Success\n”);
printf(“Copying data to device memory\n”);
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
printf(“Success\n”);
F<<<46702, 512>>>(n, a_d);
printf(“Success\n”);
printf(“Copying data to host memory\n”);
cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);
printf(“Success\n”);
for (int i=0; i < 18; i++) printf(“%d %f\n”, i, a_h[i]);
printf(“\n\n”);
for (int i=n-18; i < n; i++) printf(“%d %f\n”, i, a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
I get this result:
Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 7.000000
1 7.000000
2 7.000000
3 7.000000
4 7.000000
5 7.000000
6 7.000000
7 7.000000
8 7.000000
9 7.000000
10 7.000000
11 7.000000
12 7.000000
13 7.000000
14 7.000000
15 7.000000
16 7.000000
17 7.000000
33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000
but when i change kernel configuration from “F<<<46702, 512>>>(n, a_d)” to “F<<<46703, 512>>>(n, a_d)” i get this result:
Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000
33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000
[b]It looks like device refuses working with such kernel configuration, and returns the argument array untouched. But looking to device properties it seems like it should be able to handle at least 65535 blocks. What might be causing such problems?
By the way, I’ve tried to change kernel function to:[/b]
global void F(int n, float *a)
{
int index = (gridDim.x * blockIdx.y + blockIdx.x) * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;
if (index < n) a[index] = 7;
}
And kernel configuration to:
dim3 grid_configuration(512,512,1);
dim3 block_configuration(512,1,1);
F<<<grid_configuration, block_configuration>>>(n, a_d);
And it works well and fills all the elements of the array, and it seems like this call can handle 512512512 = 134 217 728 array elements, but when i set “n = 1 << 26” it gives the following result:
Running F over 67108864 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000
67108846 1.000000
67108847 1.000000
67108848 1.000000
67108849 1.000000
67108850 1.000000
67108851 1.000000
67108852 1.000000
67108853 1.000000
67108854 1.000000
67108855 1.000000
67108856 1.000000
67108857 1.000000
67108858 1.000000
67108859 1.000000
67108860 1.000000
67108861 1.000000
67108862 1.000000
67108863 1.000000
And when I’m trying to find the maximal array size with which device can work with and set the value of “n = 55650000” a call to “cudaMalloc((void**)&a_d, size)” throws exception!!!