__host__ code compiled using c++ compiler?

hello,

I have been looking into using CUDA and OpenCL for a mathematical simulation model. I originally used CUDA, which gave me a speed up of up to 6 times for certain aspects of my code. However, when I converted the CUDA code to OpenCL, the speed up was a lot less (and actually slower in some cases).

I tracked down the differences to being in the host code and found that when I was iterating over a STL vector it was much slower in the C++ host code for the OpenCL project than in the CUDA code for the CUDA project. Therefore, I tried just comparing C++ host code with CUDA host code to see if it makes a difference…and it does.

The simple example is given below:

C++code:

[codebox]

#include

#include <time.h>

void TestCUDA();

void TestCPP()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

clock_t beginTime = clock();

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

clock_t endTime = clock();

float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("C++ calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

}

int

main( int argc, char** argv)

{

TestCPP();

TestCUDA();

}[/codebox]

CUDA code:

[codebox]

#include

#include <time.h>

host void TestCUDA()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

clock_t beginTime = clock();

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

	std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

clock_t endTime = clock();

float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("CUDA calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

}[/codebox]

When I run the application I get the following result:

C++ calculation done in 546.00 milliseconds.

CUDA calculation done in 266.00 milliseconds.

As you can see the code executed in the CUDA __host__function is identical to the C++ function. I thought that any __host_code was compiled using the C++ compiler in the same way that any host C++ code would be. How come I get different speeds then?

Thanks.

Just thought I’d add that it doesn’t matter how long the simulation is, the function executed in the host function is consistently twice as fast.

I’ve just removed the timing code to the host code in case tha was working differently on the GPU, but now I get an even more marked difference in speeds. The C++ code is now:

[codebox]

#include

#include <time.h>

void TestCUDA();

void TestCPP()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

	std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

}

int

main( int argc, char** argv)

{

clock_t beginTime = clock();

TestCPP();

clock_t endTime = clock();

float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("C++ calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

beginTime = clock();

TestCUDA();

endTime = clock();

differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;

printf("CUDA calculation done in %.2f milliseconds.\n", differenceMilliSeconds);

}[/codebox]

and the CUDA code is:

[codebox]

#include

host void TestCUDA()

{

std::vector<int> intList;

unsigned int simulationTime = 10000;

for (unsigned int i = 0 ; i < simulationTime; i++)

{

    int intToAdd = i;

    intList.push_back( intToAdd );

	

    std::vector<int>::iterator iter = intList.begin();

    while (iter != intList.end())

    {

        ++iter;

    }

}

}[/codebox]

When I run the application I get the following results!

C++ calculation done in 3812.00 milliseconds.

CUDA calculation done in 281.00 milliseconds.

Why is so much faster in the CUDA host code than in the C++ host code?