Kernel configuration and maximum array size problem.

I have this device configuration

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GS”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268435456 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.00 GHz
Concurrent copy and execution: No

Test PASSED

After running this program:

#include <stdio.h>

#include <cuda.h>
#include <cutil.h>

global void F(int n, float *a)
{
int index = gridDim.x * blockIdx.x + threadIdx.x;
if (index < n) a[index] = 7;
}

int main()
{
float a_h, a_d;
int n = (1 << 25);
printf(“Running F over %d elements\n”, n);
int size = n * sizeof(float);
printf(“Allocating host memory\n”);
a_h = (float
)malloc(size);
printf(“Success\n”);
printf(“Allocating device memory\n”);
cudaMalloc((void
*)&a_d, size);
printf(“Success\n”);
printf(“Filling host memory\n”);
for (int i = 0; i < n; i++) a_h[i] = 1;
printf(“Success\n”);
printf(“Copying data to device memory\n”);
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
printf(“Success\n”);
F<<<46702, 512>>>(n, a_d);
printf(“Success\n”);
printf(“Copying data to host memory\n”);
cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);
printf(“Success\n”);
for (int i=0; i < 18; i++) printf(“%d %f\n”, i, a_h[i]);
printf(“\n\n”);
for (int i=n-18; i < n; i++) printf(“%d %f\n”, i, a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
I get this result:

Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 7.000000
1 7.000000
2 7.000000
3 7.000000
4 7.000000
5 7.000000
6 7.000000
7 7.000000
8 7.000000
9 7.000000
10 7.000000
11 7.000000
12 7.000000
13 7.000000
14 7.000000
15 7.000000
16 7.000000
17 7.000000

33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000

but when i change kernel configuration from “F<<<46702, 512>>>(n, a_d)” to “F<<<46703, 512>>>(n, a_d)” i get this result:

Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000

33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000

[b]It looks like device refuses working with such kernel configuration, and returns the argument array untouched. But looking to device properties it seems like it should be able to handle at least 65535 blocks. What might be causing such problems?

By the way, I’ve tried to change kernel function to:[/b]

global void F(int n, float *a)
{
int index = (gridDim.x * blockIdx.y + blockIdx.x) * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

if (index < n) a[index] = 7;

}

And kernel configuration to:

dim3 grid_configuration(512,512,1);

dim3 block_configuration(512,1,1);

F<<<grid_configuration, block_configuration>>>(n, a_d);

And it works well and fills all the elements of the array, and it seems like this call can handle 512512512 = 134 217 728 array elements, but when i set “n = 1 << 26” it gives the following result:

Running F over 67108864 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000

67108846 1.000000
67108847 1.000000
67108848 1.000000
67108849 1.000000
67108850 1.000000
67108851 1.000000
67108852 1.000000
67108853 1.000000
67108854 1.000000
67108855 1.000000
67108856 1.000000
67108857 1.000000
67108858 1.000000
67108859 1.000000
67108860 1.000000
67108861 1.000000
67108862 1.000000
67108863 1.000000

And when I’m trying to find the maximal array size with which device can work with and set the value of “n = 55650000” a call to “cudaMalloc((void**)&a_d, size)” throws exception!!!

Can’t see anything obviously wrong. You don’t seem to be doing any error checking - you could try doing that.

EDIT: What CUDA version are you on?

I’m using version 2.0, were you trying to launch the code above on your machine?

This is the function I use for calculating the index, compare with yours:

device inline int getutid()

{

int threadsPerBlock = blockDim.x * blockDim.y;

int tidWithinBlock = threadIdx.x + threadIdx.y * blockDim.x;

int gid = blockIdx.x + blockIdx.y * gridDim.x;

return gid * threadsPerBlock + tidWithinBlock;

}

I’m looking into your code…

If it throws an exception and each int is 4 bytes, that’s your memory limit excluding framebuffer etc. What’s wrong with that?

55650000*4

222600000

I agree… That might be the issue…

But why can’t i launch

global void F(int n, float *a)

{

int index = gridDim.x * blockIdx.x + threadIdx.x;

if (index < n) a[index] = 7;

}

with kernel configuration F<<<46703, 512>>>(n, a_d)

In my case array size n = 1 << 25 (n == 33554432, 33554432 * 4 = 134217728 bytes)

memory allocation is successful and a call to F<<<46702, 512>>>(n, a_d) [which is enough only for 46702*512=23911424 elements] is also successful but when i try to change kernel configuration to values that are above 46702 for example F<<<46703, 512>>>(n, a_d) it stops working. Like it had never been launched.

My system is

0.230000 (8) — CUDA : GeForce 8800 GTX

0.230000 (8) ComputeCapability = 1.0

0.230000 (8) totalGlobalMem = 805044224

0.230000 (8) totalConstMem = 65536

0.230000 (8) sharedMemPerBlock = 16384

0.230000 (8) regsPerBlock = 8192

0.230000 (8) SIMDWidth = 32

0.230000 (8) maxThreadsPerBlock = 512

0.230000 (8) maxThreadsDim = ( 512 , 512 , 64 )

0.230000 (8) maxGridSize = ( 65535 , 65535 , 1 )

0.230000 (8) clockRate = 1404000

0.230000 (8) memPitch = 262144

0.230000 (8) textureAlignment = 256

0.230000 (8) ---------------------------------------------

0.230000 (8) — CUDA : GeForce 8800 GT

0.230000 (8) ComputeCapability = 1.1

0.230000 (8) totalGlobalMem = 536150016

0.230000 (8) totalConstMem = 65536

0.230000 (8) sharedMemPerBlock = 16384

0.230000 (8) regsPerBlock = 8192

0.230000 (8) SIMDWidth = 32

0.230000 (8) maxThreadsPerBlock = 512

0.230000 (8) maxThreadsDim = ( 512 , 512 , 64 )

0.230000 (8) maxGridSize = ( 65535 , 65535 , 1 )

0.230000 (8) clockRate = 1512000

0.230000 (8) memPitch = 262144

0.230000 (8) textureAlignment = 256

0.230000 (8) ---------------------------------------------

Uses the first card.

I tried your code, with the 46703, and it flashes the screen and prints all ones. It did not do anything because there was an error somewhere. I type dmesg

[ 1186.595189] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1191.563998] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1192.884388] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1195.156270] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1220.796331] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1224.506989] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

Maybe submit a bug report. I am running 180.22. Can you try with yours, maybe on a cuda only card? (no X display)

Read what nvidia said about the NVRM:

http://www.nvnews.net/vbulletin/showthread.php?t=91256

compile with -deviceemu, run through valgrind, look for out of bounds errors.

Here’s the probem Kex, index overflows. The calculation for index is wrong

9 int index = gridDim.x * blockIdx.x + threadIdx.x;

(gdb) p gridDim.x

$1 = 46702

(gdb) p blockIdx.x

$2 = 45983

(gdb) p blockIdx.y

$3 = 0

10 if (index < n) a[index] = 7;

(gdb) p inde

No symbol “inde” in current context.

(gdb) p index

$1 = -2147469230

This should almost certainly be:

int index = blockDim.x * blockIdx.x + threadIdx.x;