What's the proper way to detect SP/CUDA cores count per SM?

empty_knapsack · July 12, 2010, 10:42am

I’m using cuDeviceGetAttribute(&cuattr, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, cuDevice); to get SM count and prior to Fermi it was simple to translate it into SP count – multiply by 8. After Fermi release I’ve added additional check if ComputeCaps >= 2.0 then SP = SM * 32. But now with GTX460 release and its 48 CUDA cores per SM it again needs additional checks.

So, is there any CU_DEVICE_CUDA_CORES_PER_MULTIPROCESSOR attribute? Any other way to detect number of SP per SM which will work for upcoming GPUs too?

Also, looks like CU_DEVICE_ATTRIBUTE_CLOCK_RATE reports wrong clocks for GTX470 (810Mhz) but that’s a different story…

tmurray · July 12, 2010, 4:18pm

Why do you care about SP count?

SPWorley · July 12, 2010, 4:39pm

I have the same question as Mr. Knapsack.

My code almost always runs on multi-GPU, often with a mix of cards in one machine. The machine I’m typing on now has a GT240, a GTX295, and a GTX480.

Before the core simulation occurs, I need to do some preprocessing and setup of data structures… sorting geometry into buckets, determining voxel data, etc. That preprocessing happens on the GPU. Every GPU needs a copy of this data. I could just have each GPU redundantly compute the same data itself, but it’s more efficient and faster to have one GPU do the work then just share the results with everyone else, and everyone starts the real compute. That’s more efficient than the redundant computes since otherwise the GT240 would be still preprocessing for many minutes while the faster cards were already working.

I examine the properties of each device, and I try to figure out the fastest card and elect it to be the preprocessor.

I do this by looking at compute level (2.0 wins). If two or more cards are 2.0, then the clock rate times SM count is the tiebreaker.

GTX460 breaks this heuristic because of the 48 SPs per SM. If the 460 is faster than the 470 for CUDA (quite possible, we need to bench it!) the clock * SM strategy will pick the wrong card.

This is a really minor concern to be honest, but it’s interesting to bring up now that the performances of the Fermi derivatives are not as simply characterized by just SM count and clock rate.

empty_knapsack · July 12, 2010, 6:14pm

Having SP count & frequency I can predict final performance without actually running any GPU kernels. As my code is totally ALU bound these predictions are very close to reality and having this information I can adjust buffer sizes to get best speed possible.

Also, some users don’t like to see numbers like GTX 470 / 112 SP at configuration panel, they starting to think that software will use only 1/4 of available processing power. With GTX 460 I suspect this question will arise again – why it’s “only 224 SP”?

Anyway, it’ll be nice to have some standard query to get this information, right now I have the only solution – parse cuDeviceGetName() output to find “GTX 460” there, obviously I don’t like it much.

tmurray · July 12, 2010, 6:16pm

Compute 2.0 devices have 32 SPs per SM, Compute 2.1 devices have 48 SPs per SM.

cbuchner1 · July 12, 2010, 8:38pm

And CUDA still has no API to query that number (32, or 48 respectively). Please add it, and you’ll have some happy forum campers. The problem is we can’t know ahead of time what this multiplicator will be for Compute 2.2 and future revisions.

tmurray · July 12, 2010, 9:15pm

blargh, I have to add a device attribute this afternoon anyway, maybe I’ll do that then

seibert · July 12, 2010, 9:43pm

So can we have (collecting recent requests):

of SPs
Peak shader clock
Current shader clock
Memory clock
Width of memory bus (or # of 64-bit channels, or whatever)
Amount of L2 cache?

Please? :)

Edit: Of course, we already have a slot for shader clock, but this would be making a new one and clarifying peak vs. current.

tmurray · July 12, 2010, 10:35pm

shader clock reported by driver is always peak
memclock is coming soon
L2 is something I don’t want to touch right now :)
width of memory bus is something I may consider at some point

seibert · July 12, 2010, 10:59pm

Right, but an additional one showing the current clock rate would be handy to know when the driver has downclocked the GPU. (As a nice precursor to the API call to force the GPU to downclock on purpose.)

wlangdon · January 5, 2015, 9:06pm

In the CUDA 6.0 samples there is a library helper_cuda.h which contains a routine
_ConvertSMVer2Cores(int major, int minor) which takes the compute capability level
of the GPU and returns the number of cores (stream processors) in each SM or SMX
You then have to multiply this by the number of stream multiprocessors in your GPU.
This can be found via cudaGetDeviceProperties().
The CUDA 6.0 samples deviceQuery.cpp has example code to do this:

#include "helper_cuda.h"

cudaSetDevice(dev);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
_ConvertSMVer2Cores(deviceProp.major, deviceProp.minor) * deviceProp.multiProcessorCount);

Bill

mvaladas · August 31, 2020, 10:40pm

I’m a newbie but just to expand my culture aren’t SPs an AMD term on their GPUs that matches NVIDIA cores? Or was the terminology different in 2010?

What's the proper way to detect SP/CUDA cores count per SM?

of SPs