Indexing Errors with a large array

blahman · February 20, 2009, 10:49pm

I seem to be having a problem getting cuda to work with a large array. What I want to do is allocate a 1D float array on the GPU of about 125 Mbytes. I understand that there are limitations on the card so you can’t directly allocate that amount of memory in one dimension, so you have to use threads and blocks to do it.

Here is the details of my card, and some code that illustrates the problem (it simply adds one to each element of an array):

There is 1 device supporting CUDA

Device 0: “GeForce 9800 GT”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.51 GHz
Concurrent copy and execution: Yes

Test PASSED

Press ENTER to exit…

#define gridsize1 100
#define gridsize2 100
#define threadsize 512

// Kernel that executes on the CUDA device
global void addone(float a, int N){
//get x,y,z
int x = threadIdx.x + blockIdx.xblockDim.x;
int y = threadIdx.y + blockIdx.yblockDim.y;
int z = threadIdx.z + blockIdx.zblockDim.z;
int xsize = threadsizegridsize1;
int ysize = 1gridsize2;
int zsize = 1;

//convert to 1D index
int index = x + yxsize + zxsize*ysize;

if (index<N) a[index] += 1.0;
}

void cudaaddone(float* input,float* output,int nvoxels){
int lengthvector = gridsize1gridsize2threadsize;
unsigned int memSize = lengthvector*sizeof(float);
dim3 threads(threadsize,1,1); // threads per block
dim3 grid(gridsize1,gridsize2,1);
float *d_vector; // vector on device

cudaMalloc((void**)&d_vector, memSize ); // allocate memory on device
cudaMemcpy(d_vector, input, memSize, cudaMemcpyHostToDevice);
addone<<< grid, threads >>> (d_vector,lengthvector);
cudaMemcpy(output, d_vector, memSize, cudaMemcpyDeviceToHost);
cudaFree(d_vector); // free memory on device
}

// main routine that executes on the host
int main(int argc, char ** argv){
int nvoxels = gridsize1gridsize2threadsize;
float *data = new float[nvoxels];
float *result = new float[nvoxels];

for(int i=0;i<nvoxels;i++) data[i]=(float)i;
cudaaddone(data,result,nvoxels);
for(int i=0;i<nvoxels;i+=nvoxels/10) printf(“%.1f, %.1f \n”, data[i],result[i]);

delete data;
delete result;
}

When gridsize1=100, gridsize2=100, threadsize=512

I get the correct result:

0.0, 1.0
512000.0, 512001.0
1024000.0, 1024001.0
1536000.0, 1536001.0
2048000.0, 2048001.0
2560000.0, 2560001.0
3072000.0, 3072001.0
3584000.0, 3584001.0
4096000.0, 4096001.0
4608000.0, 4608001.0

but when I try

gridsize1=100, gridsize2=512, threadsize=512

it fails with:
0.0, 1.0
2621440.0, 2621441.0
5242880.0, 5242881.0
7864320.0, 7864321.0
10485760.0, 10485761.0
13107200.0, 13107201.0
15728640.0, 15728641.0
18350080.0, 18350080.0
20971520.0, 20971520.0
23592960.0, 23592960.0

It seems to copy the memory to and from the Video memory just fine, its just that the kernel doesn’t execute correctly when the size of data is too much. Any ideas on why this might be?? Just as a note, The video card reports no errors with cudaGetLastError().

Thanks!

Jamie_K · February 20, 2009, 11:51pm

I love puzzles. This one had me stumped for a while.

I think the code is executing but the problem is that floating point numbers don’t have enough precision to record the +1.0 when the original number is so large. They only have 23 bits dedicated to the mantissa (or 24 if you include the initial “1.”). Your data happens to quit including the +1 right when the numbers exceed 2^24, which is also the largest integer that floating point can represent exactly.

Try adding 100.0 instead of 1.0 and see if it makes a difference.

blahman · February 24, 2009, 12:14am

blahman · February 24, 2009, 12:21am

Jamie your awesome. That was it. Thanks so much

Topic		Replies	Views
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18301	October 17, 2013
Kernel configuration and maximum array size problem. CUDA Programming and Performance	8	6476	January 25, 2009
Probably a simple answer Simple CUDA code - unexpected result CUDA Programming and Performance	7	4855	October 27, 2010
Simple 2D Array Program Crashing CUDA Programming and Performance	2	3460	October 15, 2009
CUDA kernels keep on crashing CUDA Programming and Performance	6	3646	October 27, 2008
Confusing Execution Output CUDA Programming and Performance	2	1189	January 12, 2009
Zeroes and Large Arrays Strange GPU behaviour CUDA Programming and Performance	2	1295	August 10, 2009
Unexpected behavior with varying number of threads per block CUDA Programming and Performance	2	3430	November 5, 2008
Grid dimensions CUDA Programming and Performance	6	5556	September 18, 2009
3D arrays - where to start? CUDA Programming and Performance	12	20351	October 13, 2013

Indexing Errors with a large array

Related topics