Indexing Errors with a large array

I seem to be having a problem getting cuda to work with a large array. What I want to do is allocate a 1D float array on the GPU of about 125 Mbytes. I understand that there are limitations on the card so you can’t directly allocate that amount of memory in one dimension, so you have to use threads and blocks to do it.

Here is the details of my card, and some code that illustrates the problem (it simply adds one to each element of an array):

There is 1 device supporting CUDA

Device 0: “GeForce 9800 GT”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.51 GHz
Concurrent copy and execution: Yes


Press ENTER to exit…

#define gridsize1 100
#define gridsize2 100
#define threadsize 512

// Kernel that executes on the CUDA device
global void addone(float a, int N){
//get x,y,z
int x = threadIdx.x + blockIdx.x
int y = threadIdx.y + blockIdx.yblockDim.y;
int z = threadIdx.z + blockIdx.z
int xsize = threadsizegridsize1;
int ysize = 1
int zsize = 1;

//convert to 1D index
int index = x + yxsize + zxsize*ysize;

if (index<N) a[index] += 1.0;

void cudaaddone(float* input,float* output,int nvoxels){
int lengthvector = gridsize1gridsize2threadsize;
unsigned int memSize = lengthvector*sizeof(float);
dim3 threads(threadsize,1,1); // threads per block
dim3 grid(gridsize1,gridsize2,1);
float *d_vector; // vector on device

cudaMalloc((void**)&d_vector, memSize ); // allocate memory on device
cudaMemcpy(d_vector, input, memSize, cudaMemcpyHostToDevice);
addone<<< grid, threads >>> (d_vector,lengthvector);
cudaMemcpy(output, d_vector, memSize, cudaMemcpyDeviceToHost);
cudaFree(d_vector); // free memory on device

// main routine that executes on the host
int main(int argc, char ** argv){
int nvoxels = gridsize1gridsize2threadsize;
float *data = new float[nvoxels];
float *result = new float[nvoxels];

for(int i=0;i<nvoxels;i++) data[i]=(float)i;
for(int i=0;i<nvoxels;i+=nvoxels/10) printf("%.1f, %.1f \n", data[i],result[i]);

delete data;
delete result;

When gridsize1=100, gridsize2=100, threadsize=512

I get the correct result:

0.0, 1.0
512000.0, 512001.0
1024000.0, 1024001.0
1536000.0, 1536001.0
2048000.0, 2048001.0
2560000.0, 2560001.0
3072000.0, 3072001.0
3584000.0, 3584001.0
4096000.0, 4096001.0
4608000.0, 4608001.0

but when I try

gridsize1=100, gridsize2=512, threadsize=512

it fails with:
0.0, 1.0
2621440.0, 2621441.0
5242880.0, 5242881.0
7864320.0, 7864321.0
10485760.0, 10485761.0
13107200.0, 13107201.0
15728640.0, 15728641.0
18350080.0, 18350080.0
20971520.0, 20971520.0
23592960.0, 23592960.0

It seems to copy the memory to and from the Video memory just fine, its just that the kernel doesn’t execute correctly when the size of data is too much. Any ideas on why this might be?? Just as a note, The video card reports no errors with cudaGetLastError().


I love puzzles. This one had me stumped for a while.

I think the code is executing but the problem is that floating point numbers don’t have enough precision to record the +1.0 when the original number is so large. They only have 23 bits dedicated to the mantissa (or 24 if you include the initial “1.”). Your data happens to quit including the +1 right when the numbers exceed 2^24, which is also the largest integer that floating point can represent exactly.

Try adding 100.0 instead of 1.0 and see if it makes a difference.

Jamie your awesome. That was it. Thanks so much