I’m working on a real time finite element solver. I’m using 2 kernels so far. The first one with a loop on elements to compute the nodal forces. And I’m re-using the forces in my second kernel with a loop on nodes to compute the nodal displacements. Therefore in my first kernel I need to write in a node size array although the kernel is running for each element. So here is what my code looks like:
__global__ void Kernel1(float* Force_array)
{
/// Texture index
int th = blockIdx.x*blockDim.x + threadIdx.x;
/// Node list for the current element (contains node numbers for each element)
int4 ElNodes = texfetch(ElNodeInd_ref, th);
Nice calculations... And at the end:
/**
* Writes the result in global memory
*/
// First node
Force_array[4*ElNodes.x] += Force_temp[0];
Force_array[4*ElNodes.x+1] += Force_temp[1];
Force_array[4*ElNodes.x+2] += Force_temp[2];
// Second node
Force_array[4*ElNodes.y] += Force_temp[3];
Force_array[4*ElNodes.y+1] += Force_temp[4];
Force_array[4*ElNodes.y+2] += Force_temp[5];
// Third node
Force_array[4*ElNodes.z] += Force_temp[6];
Force_array[4*ElNodes.z+1] += Force_temp[7];
Force_array[4*ElNodes.z+2] += Force_temp[8];
// Fourth node
Force_array[4*ElNodes.w] += Force_temp[9];
Force_array[4*ElNodes.w+1] += Force_temp[10];
Force_array[4*ElNodes.w+2] += Force_temp[11];
}
Force_array is an array allocated with cudaMalloc from the host and Force_temp is a temporary array created in my kernel.
I’m fully aware that different threads might try to write in the same address in the same time. Indeed two threads (two elements) may share a node and they will try to write their result in the address memory representing this node in the same time. But in the Programming Guide Version 0.8.2 in section 3.2 we can read:
The order is not defined but since I sum the numbers I don’t care about the order. I’m not sure that every writes will occur that’s true. But in my understanding that shouldn’t crash at least.
But when I launch this kernel I randomly got this error back from cudaGetErrorString( cudaGetLastError() ): unspecified driver error. And I said randomly yes. Sometimes I don’t get any error. And I’m not in the situation where the first call works fine and the next ones don’t. If I keep trying, after 3 or 4 calls for instance it will work again. If I don’t have any OpenGL rendering of my results (I send the nodal displacements I computed and move one object basically) I got that error and that’s it. But if I render I barely got time to see the error messages that my computer has already frozen and I have to reboot. But everything works fine if I’m not using scatter operations.
When I looked for that error message in forums I found people saying they got that message when they were trying to run the CUDA examples with a 64 bits processor. I got a Dell Precision 690 with a dual core Xeon processor (64 bits), a GeForce 9800GTX and I’m running Kubuntu.
So my question is why do I get this error? Is it because on the undefined behaviour from the scatter operations? Or is it because I’m using a 64 bits processor with a 32 bits driver and it’s not always working properly? Or a little bit of both perhaps?
Any help would be very appreciated. Thanks for your time.
It doesn’t matter whether the processor is 64bit or not. But you need to run a 32bit OS kernel, ie. WinXP standard or Linux 32bit. On your ubuntu box, check that ‘uname -a’ does not say x86_64 anywhere.
I get “unspecified driver error” sometimes when I write far past the end of a shared mem or global mem array. If I only write a little past the end of an array, I usually just get “unspecified launch failure” or it just runs without problems. Reading past the end of such an array seems to have no such consequences.
Your “summing” scatter won’t necessarily perform as you imagine. Here is an example: thread 1 and 2 go to add their force to the same global mem location. Both would read in the SAME value for the +=, add their value to it and write. See how you would end up with one or the other, not the sum of both? And given that there is probably a staging time in the memory controller before the value is actually written, there will likely be a very large window open where the value still hasn’t been updated.
In general (as I understand it), you MUST assume that any single bit of global mem written by one kernel CANNOT by read and known valid until that kernel has finished executing.
Regardless of the read race condition you have, scatter into global mem can be done, but is very, very slow: ~1GB/s or less. In part of my application, a scatter implementation on the GPU is actually ~2 times SLOWER than the CPU calculation. Re-implemented as a gather problem, it is 50 times faster. That is not to say scatter is never worth it; if you are compute bound and not memory bound, you can get away with it. Another piece of my app needs to compute ~150 GFLops just to feed 0.1 GB/s of mem transfer, and there the scatter I’m using doesn’t seem to affect the performance.
I would highly suggest trying to think of a way to recast your problem as a gather on force points. Couldn’t you just invert your table to be able to lookup each node needed given the point where the force will be summed? Make sure when you do this that you have fully coalesced memory writes (see the guide). Fully coalesced vs non-coalesced can change performance by a factor of 20 or more.
Oh, and one more tip. My application computes forces too, and I’ve found that higher performance can be achieved by storing the x, y, and z components each in a separate array packed with stride 1. This makes it dirt simple to get full coalescing, and its even faster than fully coalescing float4 reads/writes in my tests. To make things manageable, just pass in a struct that contains the three pointers.
On a related note, the GeForce 8600 and 8500 Series support read-modify-write atomic operations to global memory: The read-modify-write is guaranteed to be performed without interference from other threads.
These atomic operations will be exposed in the next toolkit release.
They have the same low bandwidth inconvenience as regular writes, so all the tips from MisterAnderson42 remain relevant, but at least they can be used to make the algorithm above functionally correct.
I use a 32 bits linux yes. The driver for CUDA don’t exist for 64 bits versions anyway. I was just wondering if it could be linked to that in one fashion (although I didn’t really see how, but after a while you’re kind of desperate and you consider everything). And yes they are nice with me at work. They put a beautiful 9800GTX inside a lovely box :)
MisterAnderson42>
I checked the address I was trying to access. I was sure that the node numbers I was using (the Elnode something) were correct (I had already checked them of course). But after deeper tests I noticed that my results were varying with the size of my object. Depending of the number of elements in my object, sometimes the array Elnode is correctly filled in but sometimes is not. Actually it’s the texture fetching who doesn’t work properly. Because the array allocated on GPU with cudaMalloc is correct (when I read back with cudaMemcpy in one float array on the host and I print it). But with a simple code like that:
__global__ void Kernel1(int* array)
{
/// Texture index
int th = blockIdx.x*blockDim.x + threadIdx.x;
/// Node list for the current element
int4 ElNodes = texfetch(ElNodeInd_ref,th);
/**
* Writes the result in global memory
*/
array[4*th] = ElNodes.x;
array[4*th+1] = ElNodes.y;
array[4*th+2] = ElNodes.z;
array[4*th+3] = ElNodes.w;
The texture fetching doesn’t work correctly each time. It depends of the number of elements (and so the numbers of threads). Sometimes I got 0 or a big number everywhere, sometimes everything’s working, sometimes it’s working partialy and the other part is zero or a big number. It’s weird.
I actually want the sum of my elements. So it’s good. But since I’m not sure that every writes will occur I might have a not correct value.
Therefore, because I’m not certain to get a accurate result, I guess i will have to re-implement my algorithm in a gather fashion. It could be faster according to you anyway.
Thanks for the tip
Cyril>
When you say “These atomic operations will be exposed in the next toolkit release” it means that it doesn’t work properly so far on a 9800GTX. So I might have bad results. Any ideas about a date release for the next version of CUDA? Because it would really be something interesting to have…
Will these atomic read-modify-write operations be available on the G80? And can you tell us what the performance penalty is for the operation compared to a non-atomic read-modify-write of global memory?
Interesting. How are you setting up the data for texfetching? You mention cudaMalloc, so I guess it is an area of global memory that you bound to a texture. I’m not familiar with that, so others may have some idea why you get random junk everywhere. I make my 1D data into a 2D texture in “arrray” memory 8 wide and N/8+1 high. This gives good 2D locality for the texture lookups, which are semi-random for me.
Anyways, this simple example code illustrates exactly the same thing that leads to me writing past the end of the array and getting the unspecified driver error. Say I have 15625 forces to write out and my threads have a block size of 384. Thus, there are 41 blocks running to cover all the particles, and the last 119 threads are writing past the end of my force array (which was only malloc’d to be size 15625 floats). A simple “if (th < N)” is good enough to fix the problem and only causes a single divergent warp at the end of the array. One must be very careful though, if __synthreads() is used anywhere in the body of the kernel, the if cannot simply cover the whole body and must be more carefully placed to avoid performing calculations on non-existent entities and yet minimize divergent warps while still participating in the __syncthreads().
where ElNodeInd is a float array I compute on the CPU. But the array where I copy the data I read with the fetching has been allocated with a cudaMalloc. Then I copy the float array in a cudaArray in order to read these data later with a texture fetching in my second kernel. Because as far as I understood I can’t directly give the cudaArray as one argument of my kernel. But anyway it’s not the point. I think I found my problem.
By doing NumEls/dimBlock.x it depends on the number of elements I got. If NumEls is a multiple of my block size everything is perfectly fine. Otherwise I don’t have enough threads to loop on every elements. Thus there are locations in memory who stay uninitialised. And I got partially zero (or totally is the number of element was smaller than my block size) when I was initialising my float array to zero with cudaMemset prior to launch the kernel and I got big numbers if I was trying to initialise that to 1 for instance (I read that cudaMemset doesn’t seem to work very well). Anyway since I use these numbers later as addresses of another array in global memory, when I got big numbers, it was trying to write far after the memory and I got ‘unspecified driver error’. So my problem was that. But it’s not easy to know from ‘unspecified driver error’ that my error comes from the block size… Error messages are not obvious at all. If Nvidia can hear me, it would be great to have more understandable messages and maybe a documentation.
So if I well understood, I need to have a block size a multiple of the number of elements. Or use something like if thread < NumEls. I thought the out of range addresses were handled. But only with texture fetching apparently, not with writes in global memory.
I forgot an important point in my previous post: Atomic operations only work on 32-bit signed integers. So, I was actually wrong in saying that they can be used to make the algorithm above functionally correct.
Atomic operations are a hardware feature supported on the GeForce 8600 and 8500 Series, but not on the GeForce 8800 Series.
So what’s the point of scatter operations in that case? If we write in the same location the result may be incorrect. I would have prefered the operations to be serialized to guarantee a correct result whatever happen. Because it means we can’t add numbers through threads in global memory for instance. Or am I wrong?
I want to understand what means “atomic operations”. I am Russian and I can’t treat this term. Give me, please, some examples of such operations for understanding.
global += register requires the following to occur in hardware:
read in current value of global into a temporary register tmp
add register to the temporary register
then write the temporary register back to global
The process is said to be “atomic” if and only if the ENTIRE process can be done as one operation without the chance of another thread interfering. Hence the use of the word atomic, which means the smallest building block that cannot be broken down further.
It may be easier to understand why an operation is NOT atomic. I’ll repeat my example above: imagine two threads that want to simultaneously increment the same global value.
Thread 1:
read global into tmp_reg1
tmp_reg1 = tmp_reg1 + value1
write tmp_reg1 to global memory
Thread 2:
read global into tmp_reg2
tmp_reg2 = tmp_reg2 + value2
write tmp_reg2 into global memory
Since these are NOT atomic operations, the simultaneous running of threads 1 and 2 would produce this:
read global into tmp_reg1 read global into tmp_reg2
tmp_reg1 += value1 tmp_reg2 += value2
write tmp_reg1 to global write tmp_reg2 to global
According to the programming guide, only one of the writes will succeed. So you should be able to see that the final value of global is either global initial + value1 OR initial + value2.
In the scatter process (if you still want to keep using it), an easy solution is to group elements into several passes by using a coloring. Each pass is handled by a kernel call which is therefore done on elements not touching each others. The interference is therefore eliminated.
That’s an interesting idea, I’ll need to think about it some to figure out how to apply it to my most complicated scatter situation, which is actually the generation of the connectivity graph. (seems a bit of a chicken and an egg problem, doesn’t it?). Though I may not need it as I have a hybrid CPU/GPU implementation in the works.
Though, I might be able to use the graph coloring on the piece of the algorithm that actually uses the connectivity graph to compute forces. Right now, I’m computing every force twice, instead of using newton’s third law, because there is no global memory synchronization. Current speedups compared to the CPU are phenomenal, but if this graph coloring can make it even better, that would be sweet :)
I’m not very familiar with the implementation of coloring algorithms. Would the overhead be significant to color a graph, say with 50,000 vertices, each having ~100 edges? To get the kind of performance I’m hoping for, this connectivity graph needs to be updated ~300 or more times a second.
EDIT: On second thought, perhaps this technique won’t help. Even if there is a zero-overhead way to correctly handle all of the synchronization problems with scattering the newton’s third law forces, there is still one big problem: I’m currently memory access bound in my computation, and the read-modify write cycle will more than double the memory accesses it makes, nullifying any performance improvement by doing only half the computations. So I’ll have to stick with what I have.