Cannot get emulation working.

Hi All,

I have a very simple CUDA program that I grabbed off the web to play with (http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/). This program computes the squares of integers from 1 to 10. It works fine when run with my GeForce 9800 GT card, but it does not work in emulation mode. It crashes in the first call to cudaMalloc:

First-chance exception at 0x75c39617 in CudaHelloWorld.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0014f7dc…
First-chance exception at 0x75c39617 in CudaHelloWorld.exe: Microsoft C++ exception: cudaError at memory location 0x0014f854…

Modifying this code so that it checks the return value of cudaMalloc (a good habit to get into), I find that it returns 11, which according to cudaGetErrorString(11) is an “invalid argument”. Then, I added a call to rv = cudaGetDeviceCount(&device_count). It returns rv=0, and device_count=1, so there is one device–at least in theory. So, I would think cudaMalloc() should work.

Here is the program:

// example1.cpp : Defines the entry point for the console application.
//

#include “stdafx.h”

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}

I am working on: Windows 7, GeForce 9800 GT, VS 2008, CUDA 3.0, GPU COMPUTING SDK 3.0, Cuda build rule 3.0.14, on a quad-core with 4 GB memory. In the Cuda build rule, I set emulation mode to “Yes”, which then sets the appropriate options, and linking to cudart.lib. (See attached .zip project.) I’ve seen similar questions over the web, but no one seems to have figured why cudaMalloc crashes. Anyone have seen something similar and figured out the problem?
CudaHelloWorld.zip (265 KB)

I figured it out. Apparently, you must now (as of version 3.0+) link with cudartemu.lib if you are building the application for emulation mode. According to the release notes (http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/cudatoolkit_release_notes_windows.txt):

“Separate Library for Runtime Device Emulation- Cudart has now been split up into two libraries. Cudartemu should be linked with for device emulation, similar to the way in which Cublasemu/Cufftemu were previously used.”

Incredibly, there are only 10 results of a search for cudartemu.lib in Google.com–and this includes the release note mentioned above. Where is this in the in the NVIDIA CUDAâ„¢ Programming Guide Version 3.0 (http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide.pdf)? In that document, it says:

“Linking code compiled for device emulation with code compiled for device execution causes the following runtime error to be returned upon initialization: cudaErrorMixedDeviceExecution.” But, later on, it says: “There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual). One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime.” Sure enough, there is no way to explicitly call initialization and get “cudaErrorMixedDeviceExecution”.

Who the Einstein who decided to make this change? I have to say, this is a bad API change, and should be documented more clearly.