I have a function f() which works on 3MB of (constant) data and has some input parameters.
I need to do a MonteCarlo with the input parameters of f(), so I do not want (and maybe can’t)
split the 3MB data among threads.
Therefore I need to implement a farm (replicate the function f() and run it with different parameters but making it work on the same 3MB data).
I’m worried about the memory access latency. Is every thread able to read in parallel the same data on the DRAM? If not what should I do?
The execution time of f() has a very low variance, so it would not be a problem to call frequently
thread synchronization if it optimizes the memory access.
The 3MB data is accessed sequentially so it would be possible to call synch threads after I finish using the small chunk on the shared memory.
Also I do not need to write the program for a generic GPU, I already know before execution the specs of the GPU I’m using.
Yes, that is a good use case for CUDA. What GPU are you using? On a compute capability 1.x device, you should use a texture for reading the data, so that all currently running blocks share the data via the cache hierarchy. I don’t think that manually staging the data via shared memory increases speed, unless you can manage to reuse data from shared memory multiple times per thread.
You’ll need to experiment whether some __syncthreads() increase speed or not (you cannot sync between blocks anyway, so it’s unclear whether partially syncing all currently running threads helps or not).
indeed, that’s why I was fearing it could be useless.
However maybe I could do something like this when I finish using the current data in the shared memory(for each block):
I don’t know if it is worth it.
How many cycles does a shared memory read require? And a cache read?
I have a laptop and a desktop computer.
Here is the “Device query” result:
1)Laptop:
Device 0: “GeForce 9600M GT”
CUDA Driver Version / Runtime Version 4.10 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 4) Multiprocessors x ( 8) CUDA Cores/MP: 32 CUDA Cores
GPU Clock Speed: 1.25 GHz
Memory Clock rate: 400.00 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D
=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192)
x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Computer
Device 0: “GeForce 315”
CUDA Driver Version / Runtime Version 4.10 / 4.0
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 512 MBytes (536870912 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.10 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D
=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192)
x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Sorry to bother, but I have some other questions:
Where can I see how big the cache size is?
What are the comands for writing in the texture memory from the cpu and reading it from the gpu?
Cache size is documented in to be between 6 and 8 kb/multiprocessor, although reverse engineering found it to be only 5kb/SM. Textures are documented in section 3.2.10 of the Programming Guide, but you should read the whole document before programming CUDA anyway.
On your compute capability 1.x devices texture reads are the only way to share global memory bandwidth between different blocks, so you should definitely use a texture. Once you do that, I don’t expect any further speedup from using shared memory for purely sequential reading of the data.
On compute capability 1.x shared memory access for one operand reduces throughput from 1 instruction every 4 cycles to 1 instruction every 6 cycles and increases latency from 24 to about 36 cycles. A second operand from shared memory requires an extra 6 cycles for a separate instruction. On compute capability 2.x devices, shared memory accesses are slower.
On the GTX 590, you will have to distribute the work to both devices manually.