In my CFD work I have different types of variables
constant -> declared in constant memory
recalculated -> at the moment all in global
The main variable is the density, rho, of each particle, which is calculate at the start of each iteration after all the particle interactions have been found. At the moment I find all the interactions between particles. This does not lend itself well to using shared memory for rho so that subsequent functions in the iteration can access all values in the rho array. I tried adding a section of code that asks the threads in block to transfer its particle index and all multiples of the blocksize into a shared memory array called shared_rho. This initially took 16% off the execution time and gave the correct solution but after a few executions the solutions were suddenly garbage, so I don’t know what went on there.
With shared memory being so limited and the eventual aim being to simulate milllions of particles this shared memory approach is impractical and I think I’ll start implementing the bucket algorithm, with the bucket size = blocksize, thus shared memory can be efficiently used.
The only other memory I haven’t tried out yet is texture. I’ve had a look at the documentation and don’t fully understand texture.
Can texture memory be used for a CFD calculation in which a variabe array such as rho, and other physical parameters such as pressure, which is recalculated at the start of each iteration and used very frequently by functions later in the iteration? If so, how?
Kernels can’t write to textures, only read from them. You can however write into global memory and then map this global memory to a texture after each kernel, so that the next kernel launch will have a new texture. AFAIR that’s a very fast operation.
You can indeed use texture memory with CFD. It can be really beneficial since there’s usually some spatial locality (ie. a particle is affected by its nearest neighbors) and this can make use of the texture caching logic. That kind of semi-stochastic locality (you may not know exactly how many particles went into certain buckets but you know they all lay somewhere near) may be easier to exploit with textures than using explicit coalescing and shared memory. You only have to worry about coalescing writes to global memory.
When you’re reading some data from a texture and it resides in the texture cache, you’re getting the same speed as if you were reading from shared memory (AFAIK). With data locality you should be getting a lot of cache hits.
Yes, you could do it like this. You could even (probably, I don’t know the exact algorithm) even have kernel1 use textures. The general algorithm would probably look somewhat like this:
Create 3D (2D if you’re doing a 2D solver) Cuda Arrays of initial positions, velocities, densities, pressures etc. Let’s call them cuArrVel, cuArrPress etc. There may be multiple arrays because they can only be of type float/int/uint or 1-, 2- and 4- vectors like float2, float4. You can’t have entire structs with all particle properties as texels. So you’ll have an array of velocities, array of pressures etc.
Map those Cuda Arrays to 3D textures (cudaBindTextureToArray()). 3D textures were introduced in CUDA 2.0. Earlier versions had only 1D and 2D textures… You now have a texture of velocities (texVel), a texture of pressures (texPress) etc.
Have kernels find particle interactions reading data of neighboring particles from the 3D textures (tex3dfetch()) and writing to linear global memory
After the kernel ended, copy the linear global memory to cuArrs and remap the cuArrs to corresponding textures (cudaBindTextureToArray() again) and launch the next kernel
This is all in the Programming Guide. A quick summary here (I’ll try not to mess things up but you might want to confront this with the manual)
You have to define texture references @ file scope (immutable, known at compile time). Those have a general form of
texture<Type, Dim, ReadMode> texRef;
(see PG for explanation)
You can define many texture references:
texture<float, 3, cudaReadModeElementType> texPress;
texture<float, 3, cudaReadModeElementType> texRho;
texture<float4, 3, cudaReadModeElementType> texVel;
//i'm not sure if you can use float4 in 3D textures. if not you will need separate texture references for VelX, VelY, VelZ
In host code you have to set up some parameters of those references:
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMallocArray(&cuArrPress, &channelDesc, width, height);
//this is how you set up 2D cuda Arrays, I don't have CUDA 2.0 PG handy
Now you need to assign a region of linear global memory to each cuda Array
cudaMemcpy2DToArray(cuArrPress, 0, 0, devPress, pitch, width * sizeof(float), height, cudaMemcpyDeviceToDevice);
//you will need cudaMallocPitch() or something like that before, again see PG
//as i don't have mine, don't know how you setup 3D cuda arrays
//devPress is the linear global memory array of pressures to which your kernels may write
//you can't write to cuda arrays or textures from the kernel
I have the following code to copy new rho on the device into host rho, which is then copied to cuda array rho on the device, and cuda array rho is then bound to tex_rho
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
CUDA_SAFE_CALL( cudaMallocArray( &cu_rho, &channelDesc, NTOTAL, 1 ));
//set up texture for new rho
sumdensity<<<NUMBLOCKS,NUMTHREADS>>>(d_rho); //write new rho into d_rho
cudaMemcpy(h_rho,d_rho,size_rho,cudaMemcpyDeviceToHost); //get d_rho back from device into h_rho
CUDA_SAFE_CALL( cudaMemcpyToArray( cu_rho, 0, 0, h_rho, size_rho, cudaMemcpyHostToDevice)); //write h_rho into CUDA array cu_rho
// set texture parameters?
cudaBindTextureToArray(tex_rho,cu_rho); //bind cu_rho to tex_rho, kernel1 should now refer to tex_rho, not rho
//end set up of new rho as texture
the code compiles and executes, but I get garbage. with the last error being “invalid argument”
What could be causing this? It would be good to get this working because the execution time was reduced by 20% for just this one variable as a texture. There are about ten other variables I could do this with.
I just figured that if one were addressing textures by x,y,z and those coordinates were some arbitrary indexes in a 3d array, not reflecting physical positions of the particles, there wouldn’t be that much locality. After you start seeing some flow separation, vortexes etc. it’s quite possible, that particles physically in neighborhood would have distant array indexes and vice versa. To counter this and get real locality (=cache hits with textures) you would need to relate particle indexes with their positions, that can be costly (sorting and rearranging)…
As for your above code
You don’t have to copy d_rho to host and then into cu_rho from there, you can keep d_rho on the device and copy to cu_rho using cudaMemcpyToArray() with the parameter cudaMemcpyDeviceToDevice. You’re loosing much time copying back and forth through PCIe.
You might look into simpleTexture example in the SDK to get a hang on how to use textures.
Yep, I use the hilbert space filling curve to sort particles. This improves cache hit rates immensely when threads access physically local particles via texture lookups. For system sizes of 64,000 particles the speedups are ~5x and that goes up for larger systems. With a pre-calculated curve, actually performing the sort is extremely cheap (compared to the rest of my calculation) and particles don’t move very far in one step of MD so the sort stays good for ~500 steps when I resort.
A previous version of the poster mentioned is available for download at http://www.ameslab.gov/hoomd under the APS March meeting news. There is also a reference to our paper in J. Comp. Phys (a pre-print can be downloaded from the website) that discusses the details. Section 2.3 discusses the particle sort.
Which line gives the invalid argument? What are size_rho and NTOTAL? I’m guessing you are hitting the limitations of the texture size which is really small for 1-D array textures. Or it could be your height value. I seem to recall needing a height of 0 for 1-D array textures, but I’m not sure. I always use cudaBindTexture and tex1Dfetch which have much fewer limitations.
Did you really want a 1D texture, though? Isn’t your fluid 2D or 3D?
How is rho to be accessed in your update kernel? I.e. does each thread read one value and those around it or is the access random from thread to thread depending on particle positions or something. If it is the first, then you would be best off using shared memory: check out the image filter SDK examples. If it is the 2nd, I’ll need to know more details in order to make recommendations.
By the way, are cached texture reads as fast as shared memory?
I mean is it alright to have each thread read neighboring particles by
tex3d(ref,x+1, y, z);
tex3d(ref,x, y+1, z);
This would normally be a prime example of uncoalesced reads and would call for having shared memory buffer, have it loaded by separate threads and have each thread read those +1/-1 neighboring cells from the buffer.
Do we need to use such a shared buffer or are cached fetches as fast as shared memory accesses?
I’ve just found MisterAnderson’s post in a different thread
So, would we be better of not using texture memory at all and rely on coalesced global operations?
The measurements I mentioned in that thread are correct, but my conclusions were not. I’ve figured a few things out. 1) The texture units can only provide so many reads/second which can often become a limiting factor, not the device memory bandwidth. 2) Switching to a texture from global memory often increases the number of registers used in the kernel which lowers occupancy and changes performance making it hard to make 1:1 comparisons. and 3) Textures can read faster than the global memory bandwidth if you are accessing a very tiny area in memory.
In other words, it’s complicated. But as I said, the performance measurements I mentioned still stand. In every real-world algorithm I’ve implemented with textures, I’ve never seen > device mem bandwidth results.
As far as your example goes: Using textures for this
tex3d(ref,x+1, y, z);
tex3d(ref,x, y+1, z);
will probably be slower performance than a shared memory version but I don’t know by how much. Honestly, I’d write the first version of the code using the textures because that could be done very quickly (i.e. a matter of hours). If I wasn’t happy with the performance of that I’d try setting up the shared memory version. Maybe it’s not as complicated as I think it is, but getting all the boundary conditions and coalescing right would be tricky in 3D. Of course, there is always the third option: use the textures to populate the shared memory :) Then you aren’t relying on the texture cache to keep the +1 and -1 values around but don’t have to worry about coalescing.