I ran thrust inclusive scan, and measured its time using both cuda time events and computeprof, and for 23 million elements it gives me about 92ms, but when I use linux time command to measure, it gives
First I thought it is taking its time due to library linking and loading, but as I increase size it increases in a same manner too. So if anybody have some idea how this is happening, please share.
The very first actual call to cuda into your code initialises the context on the card, which can take a while. I order to reduce this time, you can enable the persistent mode on your card using “nvidia-smi -pm 1” as root (on Linux, I don’t know for the other OSs). And to get a better view of the actual time taken by your algorithm, you can exclude this initialisation time (that you pay only once in your code anyway) by calling a function that will trigger the context initialisation before to start doing anything else. One possibility is to use a call to cudaMalloc for allocating 0 bytes of data like “cudaMalloc(&a, 0);”.