Hi there,
I am a new born CUPA. I tried a simple CUDA example posted on the
web. The code is suppose to square each element in the array. The
code is successfully compiled, but it did not calculate the square of
the array. Instead, it just printed out each element of the array.
0 0.000
1 1.000
…
9 9.000
I have checked cudaMemcpy and cudaMalloc, they all passed the test
(cudaSuccess). The code is copied below. Could anyone help me figure
out what the problem is? Many thanks.
Syeteminformation:
Quaudro NVS 140M (has passed the bandwidthTest AND deviceQuary tests)
windows xp
MS C++ express.
// example1.cpp : Defines the entry point for the console
application.
//
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
if(cudaMalloc((void **) &a_d, size)!=cudaSuccess) printf(“ad wrong”);
// Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
if(cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice) != cudaSuccess)
printf(“Error in memcpy”);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
if(cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost) !=
cudaSuccess) printf(“Error in memcpy\n”);
// Print results
for (int i=0; i<N; i++) printf(“%d %f\n”, i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);