CUDA host code compiling

Hi,

I originally posted this on the CUDA for Windows XP board, but thought it may be more interest here…

I have been looking into using CUDA and OpenCL for a mathematical simulation model. I originally used CUDA, which gave me a speed up of up to 6 times for certain aspects of my code. However, when I converted the CUDA code to OpenCL, the speed up was a lot less (and actually slower in some cases).

I tracked down the differences to being in the host code and found that when I was iterating over a STL vector it was much slower in the C++ host code for the OpenCL project than in the CUDA code for the CUDA project. Therefore, I tried just comparing C++ host code with CUDA host code to see if it makes a difference…and it does.

The simple example is given below:

TestVector.cpp

[codebox]

#include

#include <time.h>

void TestCUDA();

void TestCPP()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

	std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

}

int

main( int argc, char** argv)

{

clock_t beginTime = clock();

TestCPP();

clock_t endTime = clock();

float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("C++ calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

beginTime = clock();

TestCUDA();

endTime = clock();

differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("CUDA calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

}[/codebox]

and the CUDA code is:

TestVector.cu

[codebox]

#include

host void TestCUDA()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

    std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

}[/codebox]

When I run the application I get the following results!

C++ calculation done in 3812.00 milliseconds.

CUDA calculation done in 281.00 milliseconds.

Why is so much faster in the CUDA host code than in the C++ host code? I thought the host code was compiled using teh same compiler as the host C++ code. Is this not correct?

Any help would be appreciated!

Probably just optimization settings. nvcc turns on a lot of host compiler optimizations by default, but if your host build system for the opencl version doesn’t do the same, it would be logical that the resulting code may not be as fast.

That seemed to work thanks. I changed my optimization setting from /O2 to /Ox and now I get approximately the same speeds. Does anybody know what effect this will have on the output (if any?). In the MSDN it says that “In general, /O2 should be preferred over /Ox and /O1 over /Oxs”, but doesn’t say why. Anyone have any ideas?