So what's new about Maxwell?

Here’s the output of AIDA64 GPGPU / CUDA page on a MSI GeForce GTX 750 “GM107” Maxwell card:

Device Properties
Device Name: GeForce GTX 750
GPU Code Name: GM107
PCI Domain / Bus / Device: 0 / 1 / 0
Clock Rate: 1137 MHz
Asynchronous Engines: 1
Multiprocessors / Cores: 4 / 512
L2 Cache: 2048 KB
Max Threads Per Multiprocessor: 2048
Max Threads Per Block: 1024
Max Registers Per Block: 65536
Max 32-bit Registers Per Multiprocessor: 65536
Max Instructions Per Kernel: 512 million
Warp Size: 32 threads
Max Block Size: 1024 x 1024 x 64
Max Grid Size: 2147483647 x 65535 x 65535
Max 1D Texture Width: 65536
Max 2D Texture Size: 65536 x 65536
Max 3D Texture Size: 4096 x 4096 x 4096
Max 1D Linear Texture Width: 134217728
Max 2D Linear Texture Size: 65000 x 65000
Max 2D Linear Texture Pitch: 1048544 bytes
Max 1D Layered Texture Width: 16384
Max 1D Layered Texture Layers: 2048
Max Mipmapped 1D Texture Width: 16384
Max Mipmapped 2D Texture Size: 16384 x 16384
Max Cubemap Texture Size: 16384 x 16384
Max Cubemap Layered Texture Size: 16384 x 16384
Max Cubemap Layered Texture Layers: 2046
Max Texture Array Size: 16384 x 16384
Max Texture Array Slices: 2048
Max 1D Surface Width: 65536
Max 2D Surface Size: 65536 x 32768
Max 3D Surface Size: 65536 x 32768 x 2048
Max 1D Layered Surface Width: 65536
Max 1D Layered Surface Layers: 2048
Max 2D Layered Surface Size: 65536 x 32768
Max 2D Layered Surface Layers: 2048
Compute Mode: Default: Multiple contexts allowed per device
Compute Capability: 5.0
CUDA DLL: nvcuda.dll (8.17.13.3489 - nVIDIA ForceWare 334.89)

Memory Properties
Memory Clock: 2505 MHz
Global Memory Bus Width: 128-bit
Total Memory: 1 GB
Total Constant Memory: 64 KB
Max Shared Memory Per Block: 48 KB
Max Shared Memory Per Multiprocessor: 64 KB
Max Memory Pitch: 2147483647 bytes
Texture Alignment: 512 bytes
Texture Pitch Alignment: 32 bytes
Surface Alignment: 512 bytes

Device Features
32-bit Floating-Point Atomic Addition: Supported
32-bit Integer Atomic Operations: Supported
64-bit Integer Atomic Operations: Supported
Caching Globals in L1 Cache: Not Supported
Caching Locals in L1 Cache: Not Supported
Concurrent Kernel Execution: Supported
Concurrent Memory Copy & Execute: Supported
Double-Precision Floating-Point: Supported
ECC: Disabled
Funnel Shift: Supported
Host Memory Mapping: Supported
Integrated Device: No
Managed Memory: Not Supported
Multi-GPU Board: No
Stream Priorities: Not Supported
Surface Functions: Supported
TCC Driver: No
Unified Addressing: No
Warp Vote Functions: Supported
__ballot(): Supported
__syncthreads_and(): Supported
__syncthreads_count(): Supported
__syncthreads_or(): Supported
__threadfence_system(): Supported

So it’s not a deviceQuery, but still good information. AIDA64 is a shareware benchmark tool.

Max Shared Memory Per Block: 48 KB
Max Shared Memory Per Multiprocessor: 64 KB

oh, this per block shared memory limit disappoints.

Caching Globals in L1 Cache: Not Supported
Caching Locals in L1 Cache: Not Supported

so local memory spills are no longer covered by the L1 cache? So what IS covered by L1 then?

Funnel Shift: Supported
Multi-GPU Board: No
Integrated Device: No
TCC Driver: No
Unified Addressing: No

nice that all this is now a device capability. I want a Multi-GPU board with Maxwell. But why is unified addressing not available? Maybe it requires a 64 bit system which Flery may not have had?

__ballot(): Supported
__syncthreads_and(): Supported
__syncthreads_count(): Supported
__syncthreads_or(): Supported
__threadfence_system(): Supported

oh, what’s all this?

I know it’s not DeviceQuery, but AIDA64 will provide more information on CUDA devices than DeviceQuery ;) Here’s the results of DeviceQuery:

deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 750”
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Clock rate: 1137 MHz (1.14 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GeForce GTX 750
Result = PASS


BTW, I’m using Windows 7 64-bit SP1 with ForceWare 334.89 WHQL.

AIDA64 OpenCL GPGPU benchmark results for GTX750:

Single-Precision FLOPS (FP32): 1190 GFLOPS
Double-Precision FLOPS (FP64): 37.72 GFLOPS
24-bit Integer IOPS: 399.3 GIOPS
32-bit Integer IOPS: 399.3 GIOPS
64-bit Integer IOPS: 82.76 GIOPS

GPU core clock was cca. 1175 MHz, but was fluctuating due to GPU Boost. FP64 1/32 rate was confirmed by the benchmarks, so as the 32-bit integer rate of 1/3. The latter is quite an improvement from Kepler.

So AIDA64 must be lying with respect to unified memory, because it stated “Unified Addressing: No”

Unified Addressing is reported differently in AIDA64 since it uses a 32-bit main binary. While I used the native 64-bit binary for DeviceQuery.

Output of 32-bit DeviceQuery:

deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 750”
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Clock rate: 1137 MHz (1.14 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GeForce GTX 750
Result = PASS

okay, thanks for posting this!

Why did you not get the Ti model? that’s 1 GB more memory and 1 more multiprocessor at little extra cost…

It’s for development purposes, so all it mattered was the GM107 chip.

Those features are old ones, supported by all GPUs with a CC of at least 2.0 (Fermi+).

ok, I have stayed in computing stone age far too long. Only discovered Kepler’s warp shuffle weeks ago ;) Compute 1.x was all I needed so far.

So GM107 is sm_50, but that’s good news. Of course the CUDA 6.0 docs are silent about what sm_50’s differences actually are.

A surprise disappointment is that the shared memory accessable by a block seems to be only 48K, just like Kepler, even though we know GM107 has a full 64K. Maybe this is just a driver/CUDA issue and we’ll get the full 64K later?

NVCC indicates that sm_50 supports 64KB of shared so I think @SPWorley is right and that it’s a driver/CUDA issue.

Kepler with 64KB: Error reports max shared is 48KB

Maxwell with 64KB: Success

Maxwell with 65KB: Error reports max shared is 64KB

Keep in mind that this figure was reported by a tool (AIDA64) from a third party vendor, and this may have been reported incorrectly.

EDIT: oops, deviceQuery stated the same:

Total amount of shared memory per block: 49152 bytes

Does the code requiring 64kb per block result in a launch failure, when executed?

Christian

After spending some more time reading the high level description of the SMM, I’m now curious what the implications are of the partitioning of schedulers, registers and CUDA cores into “processing blocks” (as the GTX 750 Ti Whitepaper calls them) within the SMM.

Based on the diagram, I would assume that a warp is bound to a particular processing block for its lifetime, and cannot switch to another processing block. This will potentially have negative consequences for thread blocks that have irregular workloads in different warps. Up to now, I’ve assumed there is negligible performance penalty if warps terminate at different times, as long as the occupancy on the multiprocessor stays high enough.

With this new design, is there a possibility of starving one processing block of a SMM if certain warps terminate early?

litecoin now on par in efficiency or better than amd, thanks to Christians great work on cgminer ;)

Allanmac and I were talking about this possibility as well. But the worry of this “new problem” makes an assumption even about Kepler’s SMX architecture: are warps even on Kepler really portable between SPs? In fact Kepler is also partitioned into quadrants, with each quadrant given its own 32 SPs. Then each pair of quadrants has a shared partition with another 32 SPs, making a total of 192 SPs per SMX. So each scheduler has its own dedicated, nonportable warps. Every cycle, the scheduler issues to its own SPs, and potentially a second issue to the shared ALUs.

If that model is correct (as it seems to be), then 1st generation Maxwell SMM is a simplification of Kepler’s SMX architecture. The GM107 whitepaper whitepaper supports this model, explicitly saying that Maxwell eliminates Kepler’s non-power of two SP allocation “with some that are shared.”

So if we haven’t even noticed poor performance on Kepler due to warp/SP quadrant partitions, we probably won’t be impacted much by it on Maxwell either.

Huh, well, actually, now that tells me I should probably have been looking for this effect in some photon Monte Carlo code I was trying to optimize a year ago. :)

Good to know, though.

“Does the code requiring 64kb per block result in a launch failure, when executed?”

Yes, it fails, but silently. I tested this

__shared__ temp[49153];
temp[49152] = 0;
printf("x");

which compiles, but doesn’t run with driver 334.89. cudaThreadSynchronize() doesn’t return any error however.

Has anyone got either Visual Profiler or Nsight for Visual Studio to display performance counters?
My Visual profiler is crashing with an internal error and the performance counters for Maxwell are grayed out in Nsight.

That’s disappointing. Who wouldn’t want an extra 16KB of shared? :)

Does it fail if you dynamically allocate shared?

That is:

__shared__ u8 temp[];

and execute with:

kernel<<<1,1,49153>>>(...)

( Can’t wait for my 750 Ti to arrive this Friday. )