Thrust reduce_by_key not as fast as expected

I have the following code :

thrust::device_vector<int> unique_idxs(N);
thrust::device_vector<int> sizes(N);
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end = reduce_by_key(idxs.begin(), idxs.end(),thrust::make_constant_iterator(1),unique_idxs.begin(),sizes.begin());        
int unique_elems=new_end.first-unique_idxs.begin();
sizes.erase(new_end.second, sizes.end());

where idxs is a sorted device vector of indices, unique_idxs are the unique indices and sizes are the frequencies of each index.

Timing my program I found out that this operation takes a long time compared to other operations that handle the same or more amount of data e.g. the sorting of the initial array to find idxs. Is there any way to speed it up?

This part also causes NVIDIA Kernel Mode Crash when the size of idxs becomes more than 500k elements.

I found this presentation about Thrust and I think I am doing exactly what it describes on page 38 which is supposed to run in milliseconds even for 10M points. I even tested it on a GTX 480 and it takes about 1.2 seconds while sorting with a custom Functor takes approximately 90 ms for 500k indices.