Probably a simple answer Simple CUDA code - unexpected result

I have written (actually modified) a simple CUDA code and have shown it below.
Simple enough

  • initialize a one dimensional array
  • send it to device
  • have device find the square of every element
  • send back the answer one dimensional array

My problem comes with block and thread sizes.

Using a one dimensional grid but IF I choose to have greater than
512 blocks in the one dimensional grid then all the answers on the 513 block
and above are nonsense (usually zero).

So in the code below if I set N to be greater than 262143 then the elements in the array
greater than 262143 will be zero.

I thought that I could have a maximum grid size of 65535 x 65535 and here I seem to be limited to
512 x 1.

deviceQuery gives this:

Device 0: “Quadro FX 3700M”
CUDA Driver Version: 3.10
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.38 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Code is shown below.

============================================================

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N 262143

#define BLOCK_SIZE 512
extern “C”

global void simpleKernel(float *out, float in)
{
int index;
index = blockIdx.x
blockDim.x+threadIdx.x;
out[index]=in[index]*in[index];
}

int main(int argc, char ** argv)

{

// Declare variables on host

int pad, i;
size_t size;
pad = BLOCK_SIZE - (N % BLOCK_SIZE);
size = (N+pad)*sizeof(float);
float *s_host = (float *)malloc(size);
float *r_host = (float *)malloc(size);

// Declare variables on device
float *r_device;
float *s_device;
cudaMalloc(&r_device, size);
cudaMalloc(&s_device, size);

// Declare CUDA objects
dim3 threads(BLOCK_SIZE);
dim3 grid( (N+pad)/BLOCK_SIZE );

// Initialize data in r_host
for (i=0;i<N;i++)
{
r_host[i]=i;
}

cudaMemcpy(r_device, r_host, size, cudaMemcpyHostToDevice);

simpleKernel <<< threads, grid>>> (s_device, r_device);

cudaMemcpy(s_host, s_device, size, cudaMemcpyDeviceToHost);

for (i=0;i<N;i++)
{
printf(“%3f %3f\n”,r_host[i],s_host[i]);
}

printf(" pad = %d",pad);
printf(" threads.x = %d",threads.x);
printf(" grid.x = %d",grid.x);

free(s_host);
free(r_host);

cudaFree(r_device);
cudaFree(s_device);

}

I have written (actually modified) a simple CUDA code and have shown it below.
Simple enough

  • initialize a one dimensional array
  • send it to device
  • have device find the square of every element
  • send back the answer one dimensional array

My problem comes with block and thread sizes.

Using a one dimensional grid but IF I choose to have greater than
512 blocks in the one dimensional grid then all the answers on the 513 block
and above are nonsense (usually zero).

So in the code below if I set N to be greater than 262143 then the elements in the array
greater than 262143 will be zero.

I thought that I could have a maximum grid size of 65535 x 65535 and here I seem to be limited to
512 x 1.

deviceQuery gives this:

Device 0: “Quadro FX 3700M”
CUDA Driver Version: 3.10
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.38 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Code is shown below.

============================================================

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N 262143

#define BLOCK_SIZE 512
extern “C”

global void simpleKernel(float *out, float in)
{
int index;
index = blockIdx.x
blockDim.x+threadIdx.x;
out[index]=in[index]*in[index];
}

int main(int argc, char ** argv)

{

// Declare variables on host

int pad, i;
size_t size;
pad = BLOCK_SIZE - (N % BLOCK_SIZE);
size = (N+pad)*sizeof(float);
float *s_host = (float *)malloc(size);
float *r_host = (float *)malloc(size);

// Declare variables on device
float *r_device;
float *s_device;
cudaMalloc(&r_device, size);
cudaMalloc(&s_device, size);

// Declare CUDA objects
dim3 threads(BLOCK_SIZE);
dim3 grid( (N+pad)/BLOCK_SIZE );

// Initialize data in r_host
for (i=0;i<N;i++)
{
r_host[i]=i;
}

cudaMemcpy(r_device, r_host, size, cudaMemcpyHostToDevice);

simpleKernel <<< threads, grid>>> (s_device, r_device);

cudaMemcpy(s_host, s_device, size, cudaMemcpyDeviceToHost);

for (i=0;i<N;i++)
{
printf(“%3f %3f\n”,r_host[i],s_host[i]);
}

printf(" pad = %d",pad);
printf(" threads.x = %d",threads.x);
printf(" grid.x = %d",grid.x);

free(s_host);
free(r_host);

cudaFree(r_device);
cudaFree(s_device);

}

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

OK - that’s embarrassing !!

Thanks much.

OK - that’s embarrassing !!

Thanks much.