hello,
I have been looking into using CUDA and OpenCL for a mathematical simulation model. I originally used CUDA, which gave me a speed up of up to 6 times for certain aspects of my code. However, when I converted the CUDA code to OpenCL, the speed up was a lot less (and actually slower in some cases).
I tracked down the differences to being in the host code and found that when I was iterating over a STL vector it was much slower in the C++ host code for the OpenCL project than in the CUDA code for the CUDA project. Therefore, I tried just comparing C++ host code with CUDA host code to see if it makes a difference…and it does.
The simple example is given below:
C++code:
[codebox]
#include <time.h>
void TestCUDA();
void TestCPP()
{
std::vector<int> intList;
unsigned int simulationTime = 10000;
clock_t beginTime = clock();
for (unsigned int i = 0 ; i < simulationTime; i++)
{
int intToAdd = i;
intList.push_back( intToAdd );
std::vector<int>::iterator iter = intList.begin();
while (iter != intList.end())
{
++iter;
}
}
clock_t endTime = clock();
float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;
printf("C++ calculation done in %.2f milliseconds.\n", differenceMilliSeconds);
}
int
main( int argc, char** argv)
{
TestCPP();
TestCUDA();
}[/codebox]
CUDA code:
[codebox]
#include <time.h>
host void TestCUDA()
{
std::vector<int> intList;
unsigned int simulationTime = 10000;
clock_t beginTime = clock();
for (unsigned int i = 0 ; i < simulationTime; i++)
{
int intToAdd = i;
intList.push_back( intToAdd );
std::vector<int>::iterator iter = intList.begin();
while (iter != intList.end())
{
++iter;
}
}
clock_t endTime = clock();
float differenceMilliSeconds = float(endTime - beginTime) / CLOCKS_PER_SEC * 1000.0f;
printf("CUDA calculation done in %.2f milliseconds.\n", differenceMilliSeconds);
}[/codebox]
When I run the application I get the following result:
C++ calculation done in 546.00 milliseconds.
CUDA calculation done in 266.00 milliseconds.
As you can see the code executed in the CUDA __host__function is identical to the C++ function. I thought that any __host_code was compiled using the C++ compiler in the same way that any host C++ code would be. How come I get different speeds then?
Thanks.