Hello,
I need to determine the number of concurrent threads on a GPU.
I did some calculations and some experiments and there are some discrepancies regarding calculated and obtained results.
The purpose of this is to determine the size of the problem that I am solving that can be calculated in one (instantaneous) “kernel code call” without block/threads replacement.
For the number of concurrent threads on my GPU (930m - compute capability 5.0), if I understand it correctly, is
SMCountwarpCountthreadCount = 36432 = 6144
I suppose the algorithm should take the same time for any number of threads lower than 6144.
Having put together an extremely simple benchmarking code,
#include "cuda_runtime.h"
#include <fstream>
#include <string>
__global__ void dummyCall()
{
for (int i = 0; i < 1000;)
{
i++;
}
}
void getTimes()
{
const unsigned replicationsCount = 20;
std::ofstream dataFile;
dataFile.open("time_measures_threads.csv");
for (int i = 1; i < 7000; i++) // i=threadCount
{
float timesSum = 0;
for (unsigned replications = 0; replications < replicationsCount; replications++)
{
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start));
dummyCall << <ceil((float)i / 1024), 1024 >> > (); // 1024 is the number of threads per block according to cuda occupancy
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms = 0;
cudaEventElapsedTime(&ms, start, stop);
timesSum += ms;
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
dataFile << i << ';' << timesSum / replicationsCount << std::endl;
}
dataFile.close();
}
int main()
{
getTimes();
return 0;
}
The results are not as I expected.
I am getting the same execution time for threadCounts <= 3072, from then on the execution time is doubled.
Is there anything I am missing or could someone explain this phenomenon to me? I would be very thankful.
Thanks, Have a nice day.