Probably a simple answer Simple CUDA code - unexpected result

GourockBoy · October 27, 2010, 3:07am

I have written (actually modified) a simple CUDA code and have shown it below.
Simple enough

initialize a one dimensional array
send it to device
have device find the square of every element
send back the answer one dimensional array

My problem comes with block and thread sizes.

Using a one dimensional grid but IF I choose to have greater than
512 blocks in the one dimensional grid then all the answers on the 513 block
and above are nonsense (usually zero).

So in the code below if I set N to be greater than 262143 then the elements in the array
greater than 262143 will be zero.

I thought that I could have a maximum grid size of 65535 x 65535 and here I seem to be limited to
512 x 1.

deviceQuery gives this:

Device 0: “Quadro FX 3700M”
CUDA Driver Version: 3.10
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.38 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Code is shown below.

============================================================

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N 262143

#define BLOCK_SIZE 512
extern “C”

global void simpleKernel(float *out, float in)
{
int index;
index = blockIdx.xblockDim.x+threadIdx.x;
out[index]=in[index]*in[index];
}

int main(int argc, char ** argv)

{

// Declare variables on host

int pad, i;
size_t size;
pad = BLOCK_SIZE - (N % BLOCK_SIZE);
size = (N+pad)*sizeof(float);
float *s_host = (float *)malloc(size);
float *r_host = (float *)malloc(size);

// Declare variables on device
float *r_device;
float *s_device;
cudaMalloc(&r_device, size);
cudaMalloc(&s_device, size);

// Declare CUDA objects
dim3 threads(BLOCK_SIZE);
dim3 grid( (N+pad)/BLOCK_SIZE );

// Initialize data in r_host
for (i=0;i<N;i++)
{
r_host[i]=i;
}

cudaMemcpy(r_device, r_host, size, cudaMemcpyHostToDevice);

simpleKernel <<< threads, grid>>> (s_device, r_device);

cudaMemcpy(s_host, s_device, size, cudaMemcpyDeviceToHost);

for (i=0;i<N;i++)
{
printf(“%3f %3f\n”,r_host[i],s_host[i]);
}

printf(" pad = %d",pad);
printf(" threads.x = %d",threads.x);
printf(" grid.x = %d",grid.x);

free(s_host);
free(r_host);

cudaFree(r_device);
cudaFree(s_device);

}

GourockBoy · October 27, 2010, 3:07am

I have written (actually modified) a simple CUDA code and have shown it below.
Simple enough

initialize a one dimensional array
send it to device
have device find the square of every element
send back the answer one dimensional array

My problem comes with block and thread sizes.

Using a one dimensional grid but IF I choose to have greater than
512 blocks in the one dimensional grid then all the answers on the 513 block
and above are nonsense (usually zero).

So in the code below if I set N to be greater than 262143 then the elements in the array
greater than 262143 will be zero.

I thought that I could have a maximum grid size of 65535 x 65535 and here I seem to be limited to
512 x 1.

deviceQuery gives this:

Device 0: “Quadro FX 3700M”
CUDA Driver Version: 3.10
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.38 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Code is shown below.

============================================================

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N 262143

#define BLOCK_SIZE 512
extern “C”

global void simpleKernel(float *out, float in)
{
int index;
index = blockIdx.xblockDim.x+threadIdx.x;
out[index]=in[index]*in[index];
}

int main(int argc, char ** argv)

{

// Declare variables on host

int pad, i;
size_t size;
pad = BLOCK_SIZE - (N % BLOCK_SIZE);
size = (N+pad)*sizeof(float);
float *s_host = (float *)malloc(size);
float *r_host = (float *)malloc(size);

// Declare variables on device
float *r_device;
float *s_device;
cudaMalloc(&r_device, size);
cudaMalloc(&s_device, size);

// Declare CUDA objects
dim3 threads(BLOCK_SIZE);
dim3 grid( (N+pad)/BLOCK_SIZE );

// Initialize data in r_host
for (i=0;i<N;i++)
{
r_host[i]=i;
}

cudaMemcpy(r_device, r_host, size, cudaMemcpyHostToDevice);

simpleKernel <<< threads, grid>>> (s_device, r_device);

cudaMemcpy(s_host, s_device, size, cudaMemcpyDeviceToHost);

for (i=0;i<N;i++)
{
printf(“%3f %3f\n”,r_host[i],s_host[i]);
}

printf(" pad = %d",pad);
printf(" threads.x = %d",threads.x);
printf(" grid.x = %d",grid.x);

free(s_host);
free(r_host);

cudaFree(r_device);
cudaFree(s_device);

}

mfatica · October 27, 2010, 4:59am

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

mfatica · October 27, 2010, 4:59am

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

mfatica · October 27, 2010, 5:00am

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

mfatica · October 27, 2010, 5:00am

This:
simpleKernel <<< threads, grid>>> (s_device, r_device);

should be:
simpleKernel <<< grid, threads >>> (s_device, r_device);

GourockBoy · October 27, 2010, 11:12am

OK - that’s embarrassing !!

Thanks much.

GourockBoy · October 27, 2010, 11:12am

OK - that’s embarrassing !!

Thanks much.

Topic		Replies	Views
Kernel configuration and maximum array size problem. CUDA Programming and Performance	8	6473	January 25, 2009
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18144	October 17, 2013
Understanding deviceQuery CUDA Programming and Performance	2	4063	June 28, 2014
Maximum number of threads on thread block CUDA Programming and Performance	12	72160	September 21, 2023
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
Grid dimensions CUDA Programming and Performance	6	5415	September 18, 2009
Number of items that can be processed in CUDA CUDA Programming and Performance cuda , kernel	5	302	February 17, 2024
Max Dimension of GridSize and BlockSize CUDA Programming and Performance	8	10106	June 19, 2011

Probably a simple answer Simple CUDA code - unexpected result

============================================================

============================================================

Related topics