Memory sizes for S1070

Hello everyone,

We just received an S1070 and I am porting C++ simulation code to Cuda and in the process of trying to establish strategies for the use of various kinds of memory on the device so as to optimize performance. I have asked NVidia several times for the complete technical specs on this device with no success. Here’s what I know.

The S1070 is essentially 4 x C1060, meaning for each card it has 4GB SDRAM, 64k x 32bit registers x 30 SM, 16 x 1KB shared memory x 30 SM. No mention is made in the documentation for the size of the caches, nor the constant or texture memory spaces. I know that constant memory is cached, so it clearly is memory wired separate from SDRAM, but how much is configured? Texture memory access is faster than global, so it could be created from SDRAM with better mapping or it could be separate memory. Which is it? Either way, I need to know the limitations that apply to all the types of memory on this device in order to handle this port properly.

I think these are very reasonable questions and this information should have been included in the specs for the device. Would one of the NVidia engineers, or any forum member, who has access to this information please share it with me.

Thank you,

            - R

I think most of what you’re looking for is either in the programming guide or in the specs or in the sdk samples.

Constant is 64K size and costs 1 cycle per access if all threads access the same cell.

Here’s the output of deviceQuery:

Device 0: "GeForce GTX 295"

  Major revision number:						 1

  Minor revision number:						 3

  Total amount of global memory:				 938803200 bytes

  Number of multiprocessors:					 30

  Number of cores:							   240

[b]  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes[/b]

  Total number of registers available per block: 16384

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.24 GHz

  Concurrent copy and execution:				 Yes

You can also google for such things:

I believe this can’t be done for you, its so program/developer dependant. You should try to experiment yourself and you’ll see what works

best for your algorithm. And then you can try to tune it better, change the algorithm, re-calculate instead of read data from gmem, use shared

memory in another way, reconfigure blocks/threads etc… its just too much details - that’s the fun of it !!! :)

eyal

Check the CUDA Programming Guide. It has everything you need to know to use all of those memory spaces to their fullest: Including the sizes that you say are not there.

The short version (on a single C1060)
Constant memory:
64 KiB total
Cache per multiprocessor: 8 kiB
Optimal use: Constant memory is only optimal when every single thread in a warp is reading the same address from constant memory.

Texture memory:
Anything in device memory can be bound to a texture
Texture cache per multiprocessor: ~8 kiB

  • Texture memory is NOT faster than global. It is reading all of the data from the device memory, after all.
    The texture memory just allows you to make uncoalesced memory reads faster than un-cached global memory reads do.
    Optimal use: As the programming guide says, you get optimal performance out of the texture memory when all threads in a warp access values “nearby” in the texture.

For the long version: see the CUDA Progamming guide v2.2 section 5.1.2 for information on optimal memory access patterns and appendix A for the memory and cache sizes.

Thank you very much eyalhir74:

I didn’t see that ConstantMem was one of the variables in the cudaDeviceProp struct so that answers the question about constant memory, so based on the Dobbs article you really can allocate as much texture memory from SDRAM as you want, correct?

  • R

Thanks MisterAnderson42,

I disagree that this level of device-specific information is in the programming guide. Specific device configurations are referenced in Appendix A, and the reference to the S1070 has the following information: NumberOfMP: 4x30, ComputeCapability: 1.3. To my knowledge there is no guarantee that all devices of compute capability 1.3 will all have the same specs, only that a guaranteed minimum can be provided. Did I miss something?

  • R

Take a look at MisterAnderson’s response - he is the real professional guy :)

But yes, basically you can. This is merly a logic binding, you simply bind the texture to a device pointer

and then access it as you want. To get best performance try to have data near as MisterAnderson specified.

Still, I think that the most important thing in getting performance with the current set of tools, is trial and error.

Thats the fastest way and the most straightforward way. Just figure which code counts for 80% of your code performance

and optimize it :)

eyal

Thanks so much to both of you for answering my questions. I’m just getting started with Cuda, and finding it difficult to put together a reasonable plan for modifying this legacy code that I can have confidence in to take optimal advantage of the S1070. We’ve been running our sims in parallel on 8 Xeon cores and they are now taking 4 days or more to complete, hence the need for adding such serious GPU horsepower.

I’ll go back to climbing the Cuda learning curve now,

Thanks again guys,

  • R

One additional suggestion, start with a single GPU (dont work directly with all the 4 gpus - you’ll run into multi-gpu issues :), which you’ll have to learn

later on).

Surely enough if you manage to speed up the code with GPU, one GPU (1/4 of the S1070) will run much faster then your single 8 Xeon cores ;)

edit: good luck with the climbing… its a hell of a fun to climb this CUDA/GPU curve… :)

eyal

Yes, that’s good advice. My development machine just has a single C1060 so code development will be done in that environment but run on the Dual Xeon server that is getting the S1070. We’re doing data parallel runs now with 8 independent sims running in separate processes. What I plan to do is to assign two cores to each C1060, and then design the code so it will run the same no matter how many cores it happens to be running on. My hope is that this approach will help to simplify things when accessing multiple GPU devices. I expect most of the complexity to involve having to employ separate streams to keep each of the CPU processes separate as they are accessing the same GPU but haven’t gotten nearly that far yet. :unsure:

  • R

Section A.1.1 states very clearly all the stats for Compute 1.0, 1.1, 1.2, and 1.3. That is where I got the numbers I put in the post.

Think of the compute capability of defining exactly what the multiprocessor is. There can be differences between the actual hardware cards that are all compute 1.3: These are:

  • The number of multiprocessors (listed in section A.1 as you quoted)

  • The amount of global memory (listed on the webpage specifications or in deviceQuery)

  • The clock rate (listed on the webpage specifications or in deviceQuery)

Have fun learning CUDA!

And you are very right to start off making sure you know what the various memories are good for. Maximizing the memory bandwidth in your GPU kernels is the first thing you need to optimize for if you want great performance.

What really came through to me as the major issues from those excellent ECE498AL lectures by David Kirk and Wen-mei W. Hwu that are provided in the education section of this site were: 1) memory latency and bandwidth issues, 2) maintaining a sufficient number of threads in each block (based on register / shared memory use), and 3) maintaining a sufficient number of blocks in each grid (2 x # of SM at least). I would highly recommend those lectures to other CUDA neophytes.

  • R