Kernel configuration and maximum array size problem.

Kex · January 21, 2009, 12:08pm

I have this device configuration

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GS”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268435456 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.00 GHz
Concurrent copy and execution: No

Test PASSED

After running this program:

#include <stdio.h>

#include <cuda.h>
#include <cutil.h>

global void F(int n, float *a)
{
int index = gridDim.x * blockIdx.x + threadIdx.x;
if (index < n) a[index] = 7;
}

int main()
{
float a_h, a_d;
int n = (1 << 25);
printf(“Running F over %d elements\n”, n);
int size = n * sizeof(float);
printf(“Allocating host memory\n”);
a_h = (float)malloc(size);
printf(“Success\n”);
printf(“Allocating device memory\n”);
cudaMalloc((void*)&a_d, size);
printf(“Success\n”);
printf(“Filling host memory\n”);
for (int i = 0; i < n; i++) a_h[i] = 1;
printf(“Success\n”);
printf(“Copying data to device memory\n”);
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
printf(“Success\n”);
F<<<46702, 512>>>(n, a_d);
printf(“Success\n”);
printf(“Copying data to host memory\n”);
cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);
printf(“Success\n”);
for (int i=0; i < 18; i++) printf(“%d %f\n”, i, a_h[i]);
printf(“\n\n”);
for (int i=n-18; i < n; i++) printf(“%d %f\n”, i, a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
I get this result:

Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 7.000000
1 7.000000
2 7.000000
3 7.000000
4 7.000000
5 7.000000
6 7.000000
7 7.000000
8 7.000000
9 7.000000
10 7.000000
11 7.000000
12 7.000000
13 7.000000
14 7.000000
15 7.000000
16 7.000000
17 7.000000

33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000

but when i change kernel configuration from “F<<<46702, 512>>>(n, a_d)” to “F<<<46703, 512>>>(n, a_d)” i get this result:

Running F over 33554432 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000

33554414 1.000000
33554415 1.000000
33554416 1.000000
33554417 1.000000
33554418 1.000000
33554419 1.000000
33554420 1.000000
33554421 1.000000
33554422 1.000000
33554423 1.000000
33554424 1.000000
33554425 1.000000
33554426 1.000000
33554427 1.000000
33554428 1.000000
33554429 1.000000
33554430 1.000000
33554431 1.000000

[b]It looks like device refuses working with such kernel configuration, and returns the argument array untouched. But looking to device properties it seems like it should be able to handle at least 65535 blocks. What might be causing such problems?

By the way, I’ve tried to change kernel function to:[/b]

global void F(int n, float *a)
{
int index = (gridDim.x * blockIdx.y + blockIdx.x) * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

if (index < n) a[index] = 7;

}

And kernel configuration to:

dim3 grid_configuration(512,512,1);

dim3 block_configuration(512,1,1);

F<<<grid_configuration, block_configuration>>>(n, a_d);

And it works well and fills all the elements of the array, and it seems like this call can handle 512512512 = 134 217 728 array elements, but when i set “n = 1 << 26” it gives the following result:

Running F over 67108864 elements
Allocating host memory
Success
Allocating device memory
Success
Filling host memory
Success
Copying data to device memory
Success
Success
Copying data to host memory
Success
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000

67108846 1.000000
67108847 1.000000
67108848 1.000000
67108849 1.000000
67108850 1.000000
67108851 1.000000
67108852 1.000000
67108853 1.000000
67108854 1.000000
67108855 1.000000
67108856 1.000000
67108857 1.000000
67108858 1.000000
67108859 1.000000
67108860 1.000000
67108861 1.000000
67108862 1.000000
67108863 1.000000

And when I’m trying to find the maximal array size with which device can work with and set the value of “n = 55650000” a call to “cudaMalloc((void**)&a_d, size)” throws exception!!!

Tigga · January 21, 2009, 8:40pm

Can’t see anything obviously wrong. You don’t seem to be doing any error checking - you could try doing that.

EDIT: What CUDA version are you on?

Kex · January 23, 2009, 6:25pm

I’m using version 2.0, were you trying to launch the code above on your machine?

indy2718 · January 23, 2009, 7:35pm

This is the function I use for calculating the index, compare with yours:

device inline int getutid()

{

int threadsPerBlock = blockDim.x * blockDim.y;

int tidWithinBlock = threadIdx.x + threadIdx.y * blockDim.x;

int gid = blockIdx.x + blockIdx.y * gridDim.x;

return gid * threadsPerBlock + tidWithinBlock;

}

I’m looking into your code…

If it throws an exception and each int is 4 bytes, that’s your memory limit excluding framebuffer etc. What’s wrong with that?

55650000*4

222600000

Kex · January 24, 2009, 10:12pm

This is the function I use for calculating the index, compare with yours:

device inline int getutid()

{
int threadsPerBlock = blockDim.x * blockDim.y;

int tidWithinBlock = threadIdx.x + threadIdx.y * blockDim.x;

int gid = blockIdx.x + blockIdx.y * gridDim.x;

return gid * threadsPerBlock + tidWithinBlock;
}

I’m looking into your code…

If it throws an exception and each int is 4 bytes, that’s your memory limit excluding framebuffer etc. What’s wrong with that?

55650000*4

222600000

I agree… That might be the issue…

But why can’t i launch

global void F(int n, float *a)

{

int index = gridDim.x * blockIdx.x + threadIdx.x;

if (index < n) a[index] = 7;

}

with kernel configuration F<<<46703, 512>>>(n, a_d)

In my case array size n = 1 << 25 (n == 33554432, 33554432 * 4 = 134217728 bytes)

memory allocation is successful and a call to F<<<46702, 512>>>(n, a_d) [which is enough only for 46702*512=23911424 elements] is also successful but when i try to change kernel configuration to values that are above 46702 for example F<<<46703, 512>>>(n, a_d) it stops working. Like it had never been launched.

indy2718 · January 25, 2009, 12:04am

I agree… That might be the issue…

But why can’t i launch

global void F(int n, float *a)

{
int index = gridDim.x * blockIdx.x + threadIdx.x;

if (index < n) a[index] = 7;
}

with kernel configuration F<<<46703, 512>>>(n, a_d)

In my case array size n = 1 << 25 (n == 33554432, 33554432 * 4 = 134217728 bytes)

memory allocation is successful and a call to F<<<46702, 512>>>(n, a_d) [which is enough only for 46702*512=23911424 elements] is also successful but when i try to change kernel configuration to values that are above 46702 for example F<<<46703, 512>>>(n, a_d) it stops working. Like it had never been launched.

My system is

0.230000 (8) — CUDA : GeForce 8800 GTX

0.230000 (8) ComputeCapability = 1.0

0.230000 (8) totalGlobalMem = 805044224

0.230000 (8) totalConstMem = 65536

0.230000 (8) sharedMemPerBlock = 16384

0.230000 (8) regsPerBlock = 8192

0.230000 (8) SIMDWidth = 32

0.230000 (8) maxThreadsPerBlock = 512

0.230000 (8) maxThreadsDim = ( 512 , 512 , 64 )

0.230000 (8) maxGridSize = ( 65535 , 65535 , 1 )

0.230000 (8) clockRate = 1404000

0.230000 (8) memPitch = 262144

0.230000 (8) textureAlignment = 256

0.230000 (8) ---------------------------------------------

0.230000 (8) — CUDA : GeForce 8800 GT

0.230000 (8) ComputeCapability = 1.1

0.230000 (8) totalGlobalMem = 536150016

0.230000 (8) totalConstMem = 65536

0.230000 (8) sharedMemPerBlock = 16384

0.230000 (8) regsPerBlock = 8192

0.230000 (8) SIMDWidth = 32

0.230000 (8) maxThreadsPerBlock = 512

0.230000 (8) maxThreadsDim = ( 512 , 512 , 64 )

0.230000 (8) maxGridSize = ( 65535 , 65535 , 1 )

0.230000 (8) clockRate = 1512000

0.230000 (8) memPitch = 262144

0.230000 (8) textureAlignment = 256

0.230000 (8) ---------------------------------------------

Uses the first card.

I tried your code, with the 46703, and it flashes the screen and prints all ones. It did not do anything because there was an error somewhere. I type dmesg

[ 1186.595189] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1191.563998] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1192.884388] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1195.156270] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1220.796331] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

[ 1224.506989] NVRM: Xid (0002:00): 13, 0005 00000000 000050c0 00000368 00000000 00000100

Maybe submit a bug report. I am running 180.22. Can you try with yours, maybe on a cuda only card? (no X display)

Read what nvidia said about the NVRM:

http://www.nvnews.net/vbulletin/showthread.php?t=91256

tmurray · January 25, 2009, 1:22am

compile with -deviceemu, run through valgrind, look for out of bounds errors.

indy2718 · January 25, 2009, 1:51am

Here’s the probem Kex, index overflows. The calculation for index is wrong

9 int index = gridDim.x * blockIdx.x + threadIdx.x;

(gdb) p gridDim.x

$1 = 46702

(gdb) p blockIdx.x

$2 = 45983

(gdb) p blockIdx.y

$3 = 0

10 if (index < n) a[index] = 7;

(gdb) p inde

No symbol “inde” in current context.

(gdb) p index

$1 = -2147469230

seibert · January 25, 2009, 3:14am

This should almost certainly be:

int index = blockDim.x * blockIdx.x + threadIdx.x;

Topic		Replies	Views
Are there memory limitations on Device when using large arrays? Tesla C1060 CUDA Programming and Performance	40	15230	April 22, 2009
limit of computation CUDA Programming and Performance	44	33220	April 8, 2008
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18558	October 17, 2013
Invalid configuration argument Kernels fail to work with big arrays CUDA Programming and Performance	2	9674	October 6, 2008
Noob question: Kernel configuration failed, invalid configuration argument CUDA Programming and Performance	7	17827	July 31, 2009
Indexing Errors with a large array CUDA Programming and Performance	3	2261	February 24, 2009
Invalid configuration argument for one kernel but works for another CUDA Programming and Performance	3	789	April 11, 2024
CUDA 5.0 - cudaGetDeviceProperties returns very strange grid size CUDA Programming and Performance	10	4934	June 6, 2013
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13521	January 24, 2009
Working with really large arrays in CUDA (how to prevent negative indexes?) CUDA Programming and Performance	5	2435	November 19, 2019

Kernel configuration and maximum array size problem.

Related topics