Hi
I wanted to try the cuda programation so I have found a very simple code with a very simple kernel to try it.
her is the code suppose to square an array
#include "cuda/cuda_runtime.h"
#include "cuda/cuda.h"
#include <iostream>
#include <algorithm>
#include <iterator>
#include <sys/time.h>
#include <stdio.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
The problem is that the kernel is not lauched, the output array is not modified and with cuda getlasterror I know that the kernel is not lauched but I don’t know why.
I’m running on CentOs 6.1
Here are the specs of my nvidia card which should be enough
./mon_exec
CUDA Device Query...
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 1
Minor revision number: 1
Name: Quadro NVS 295
Total global memory: 267714560
Total shared memory per block: 16384
Total registers per block: 8192
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 512
Maximum dimension 0 of block: 512
Maximum dimension 1 of block: 512
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 1
Clock rate: 1300000
Total constant memory: 65536
Texture alignment: 256
Concurrent copy and execution: No
Number of multiprocessors: 1
Kernel execution timeout: Yes
Press any key to exit...
If I use cudacheck error I have the following message, the only problem is that I don’t have any device…
cudaCheckError() failed at main.cu:101 : invalid device function .
I have looked at several forums but didn’t find any answer. I manage to copy and retrieve from the gpu memory and everything, the only thin is that it fail to launch the kernel.
Thank you in advance for your help, feel free to ask me if you need more information.
Thomas