Hi All,
I have a very simple CUDA program that I grabbed off the web to play with (My first CUDA program! | /// Parallel Panorama ///). This program computes the squares of integers from 1 to 10. It works fine when run with my GeForce 9800 GT card, but it does not work in emulation mode. It crashes in the first call to cudaMalloc:
First-chance exception at 0x75c39617 in CudaHelloWorld.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0014f7dc…
First-chance exception at 0x75c39617 in CudaHelloWorld.exe: Microsoft C++ exception: cudaError at memory location 0x0014f854…
Modifying this code so that it checks the return value of cudaMalloc (a good habit to get into), I find that it returns 11, which according to cudaGetErrorString(11) is an “invalid argument”. Then, I added a call to rv = cudaGetDeviceCount(&device_count). It returns rv=0, and device_count=1, so there is one device–at least in theory. So, I would think cudaMalloc() should work.
Here is the program:
// example1.cpp : Defines the entry point for the console application.
//
#include “stdafx.h”
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf(“%d %f\n”, i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
I am working on: Windows 7, GeForce 9800 GT, VS 2008, CUDA 3.0, GPU COMPUTING SDK 3.0, Cuda build rule 3.0.14, on a quad-core with 4 GB memory. In the Cuda build rule, I set emulation mode to “Yes”, which then sets the appropriate options, and linking to cudart.lib. (See attached .zip project.) I’ve seen similar questions over the web, but no one seems to have figured why cudaMalloc crashes. Anyone have seen something similar and figured out the problem?
CudaHelloWorld.zip (265 KB)