pi cuda

It’s a device query for nvidia.

/opt/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: “Tesla S2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

Device 1: “Tesla S2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

Device 2: “Tesla S2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

Device 3: “Tesla S2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla S2050, Device = Tesla S2050

PASSED

Press to Quit…

And code is

#include <stdio.h>
#include <cuda.h>
#include <math.h>

#define NUM_THREAD 1024

global void cal_pi(float sum, long long nbin, float step, long long nthreads, long long nblocks) {
long long i;
float x;
long long idx = blockIdx.x
blockDim.x+threadIdx.x;

for (i=idx; i< nbin; i+=nthreads*nblocks) {
	x = (i+0.5)*step;
	sum[idx] = sum[idx]+4.0/(1.+x*x);
}

}

int main(void) {

long long tid;
float pi = 0;
long long num_steps = 10000000;

dim3 numBlocks(NUM_THREAD*NUM_THREAD*(int)sqrt(NUM_THREAD),1,1);	
dim3 threadsPerBlock(NUM_THREAD,1,1);

float *sumHost, *sumDev;
float step = 1./(float)num_steps;

long long size = NUM_THREAD*NUM_THREAD*NUM_THREAD*(int)sqrt(NUM_THREAD)*sizeof(float);
// clock_t before, after;

sumHost = (float *)malloc(size);
cudaMalloc((void **)&sumDev, size);

// Initialize array in device to 0
cudaMemset(sumDev, 0, size);

// before = clock();
// Do calculation on device
printf("Before Compute \n\n");
cal_pi <<<numBlocks, threadsPerBlock>>> (sumDev, (int)num_steps, step, NUM_THREAD, NUM_THREAD*NUM_THREAD*(int)sqrt(NUM_THREAD) ); // call CUDA kernel
printf("After Compute \n\n");
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);
printf("After Copy \n\n");
for(tid=0; tid<NUM_THREAD*NUM_THREAD*NUM_THREAD*(int)sqrt(NUM_THREAD); tid++){
	pi = pi+sumHost[tid];
	printf("The value of PI is %d\n",tid);
}
pi = pi*step;
//after = clock();
printf("The value of PI is %15.12f\n",pi);
//printf("The time to calculate PI was %f seconds\n",((double)(after - before)/1000.0));
free(sumHost); 
cudaFree(sumDev);

return 0;

}

my problem is pi’s value is 0.00000 why noy 3.1415…

please tell me or edit my code please

thank you

The kernel never gets launched, because you’re trying to launch it with grid size of NUM_THREADNUM_THREADsqrt(NUM_THREAD) = 33,554,432 blocks, which is way above the allowed maximum dimension (which is, as per the output of your device query, 65535).

That’s why it helps to put error-checking after each CUDA call.