Hiya! I’m completely new to CUDA.
I have a GeForce 6800 and WinXP SP2 with VS2005. I am trying to initialize CUDA in “emulation mode” ( I’m waiting March to get a gf8300 because I don’t have the $$$ to get a 8800 yet :P ).
Can I debug and use CUDA with that configuration?
Also have other problem… I installed the 97.73 drivers and CUDA and the SDK, but when I call the cuInit() it gives me an error ( CUDA_ERROR_NO_DEVICE, and yep, I’m using the DEVICE_EMULATION directive ).
How can I program and debug this in emulation model until I get a 8300?
Also I have a doubt… Imagine I want to do a program to perform 1million dot products… I write the CPU app using VS2005… Then I init CUDA and write the .cu to perform the dot products in the GPU… Then I load the .cu compiled module using cuModuleLoad and then I execute the module, sync threads and read back the data from the GPU to the CPU? Or ALL my program need to be compiled using nvcc.exe?
Also other doubt… I know CUDA defines float4, float3, etc… vectors like HLSL/GLSL. However, can’t find the built-in function intrinsics in the docs… Can I use the dot(), cross(), normalize() ones? Do I need BLAS for this? I got some errors with -,+, += operators…
Also an observation… the SDK is a little confusing atm. The examples and .H headers really need much more comments. Will be good to add more simple examples like, for example, to perform 1million of sequential dot-products and read back the result to the CPU ( the matrix_drv example is good but a little complicated )
cuInit() is the initialization function for the driver API, which only supports actual hardware at the moment.
I’d suggest you try compiling one of the samples in device emulation mode. Most of them are written against the CUDA runtime (CUDART), not the driver API - the driver API ones end in _drv, e.g. matrixmul_drv.
Yep yep! That’s what I want, to run it using the emulation mode.
Oh I understand now… There are two options ( correct me if i’m wrong please )
Use the CUDA driver API ( starts with cuXXXX and uses the HW).
Use the CUDA runtime API ( starts with cudaXXXXX and uses cudart.dll to interact with the driver and is easier to use ).
The emulation mode is only available using the cudart.dll.
The CUDA C files can operate basically at:
Device. Inside the hardware(gf8)
Host. Inside the main CPU ( pentium, etc )
But here there is a thing I don’t understand… The NVCC.exe allows to compile BOTH host and GPU(device) parts… The GPU part is a HLSL/GLSL shader, but using C features like pointers, etc…
The question is… Do I need really the HOST part? Thats the thing I dont understand. GCC/VS2005 have C++, CLR/CLI, templates and other things that NVCC could not do easy… Looking at the example I see you use NVCC to compile a OBJ using “custom build steps”… Do you make that to provide the emulation mode when a GF8 is not present?
Won’t be easier to provide a cuda_softwareReferenceDevice.dll with the SDK in case the developer has no GF8 installed? I really don’t like to compile my “host” program using NVCC… I just want to use NVCC to compile the GPU part and then call cuLaunch() and read back the results to the system memory…
Other thing that I think is confusing is the proposed syncronization ( shared, synchtreads, etc ) and parallelization ( threadIDx.x, blockIDx.x, banks ) . Have you considered to use a standard subset of OpenMP #pragma omp for/block to hide all that complexity?
NVCC is a compiler driver. It can invoke the CUDA compiler or gcc or the MS compiler & linker depending on the options and the source files passed to it. It works much like gcc does in this respect – both are compiler drivers. Please see the NVCC manual for more information.
Yes, you need a host portion of your application in order to load data onto the GPU and to invoke CUDA (GPU) kernels.
We use custom build steps to invoke NVCC to compile the CUDA code for both emulation and device (aka non-emu) configurations.
You have to use the CUDA runtime API to get device emulation. The driver API doesn’t support it. That said, you only have to compile the portion of your application that calls cuda* functions and invokes kernels (<<<>>>) using nvcc. You can separate other portions into other files / compilation units and compile those using the MS compiler or another compiler and then link them with the objects generated by nvcc. For an example of this, see the cppIntegration sample in the SDK.
Unfortunately, going with a less-explicit method of exposing parallelism and synchronization would make it harder to access all the performance of the GPU. Not all problems can be parallelized efficiently with OpenMP pragmas. We chose to develop extensions to C that map very closely to the way the hardware works. This has a cost in terms of the learning curve, but we feel the benefits in performance and flexibility are well worth it.