Random forest threads and blocks management

I made inference with random forest in CUDA C. The first method was to use one block with N threads with N equal to the number of trees. The second was to use N blocks with one thread. I expected the second method to be faster, but I achieved a higher speed with the first one. Do you know if it is normal or would you expect better results with the second method like me?