I’d be inclined to argue that hardware and numerics dictate each other - for the sake of simplicity, let’s consider a feature first introduced in ye olde tymes - floating point arithmatic. Back in the day, the implementation of the FPU cost enough, in terms of silicon area, ect, that including it or not made a real difference. As far as numerics went, you’d be foolish to try to use an algorithm that needed extensive floating point arithmetic on a system without native support. However, as time went on, the utility of the FPU was proven to be great enough to justify dedicated hardware. Of course, it also helped that miniaturization made it such that the cost of adding the unit was not so great as before.
In the modern era of many-core processors, the cost/benefit war has returned in the GPU space, where the question becomes “Should we add a new functionality that may or may not accelerate a given problem, or just add 5% more cores?” The answer lies in how many problems this functionality helps, and by how much as compared to the cost. But this is determined by numerics - that is, find a new algorithm that is both applicable to many problems, and that can be significantly sped up by a certain function, and you suddenly have a very compelling case for adding that functionality to future hardware. In the meanwhile, it would be smart to choose your algorithms according to which ones work best on existing hardware.
Anyway, a modern GPU is functionally more the less the same as a huge cluster of generic CPUs. The big concern for whether an algorithm will perform well on it is whether it can be broken into a great many small pieces that mostly avoid stepping on each others’ toes. Tuning for GPUs can actually be easier than for CPUs since they’re simpler overall. More importantly, the SIMT programming model allows “standard” compilers to produce code that makes “good” use of the vector hardware without resorting to pseudo-assembler and black magic. On the flip side, getting good performance out of the memory subsystem requires more black magic for GPUs than CPUs, due to GPUs’ lack of a large cache.