How tough is it to convert code written for a certain graphics card to another one? Need your advice

I am just getting started (very slowly) on CUDA. I have been doing all my programs on an emulator till now. I was planning to buy a graphics card but my computer wont support any of the high end graphics cards. The most it can support will be a GEForce 9400 GT. Anything more I would have to buy a new computer. (Its an office computer and I dont think they will let me change the power supply… Warranty issues)

 So my question is, How tough would be to develop my code on a low end GPU and if and when it works out successfully, I can think about converting the code to make it work on higher end cards (Say a GTX 280). I guess it will probably depend on the specific problem that I would be working on. But is there something like a general case..? If it will be a major task converting the code, I would probably wait and get a better computer and a better graphics card later and continue my work on an emulator for now. If it wont be a too big problem then I can get this one now.

 I am confused about this. I would appreciate any advice you guys can give me..



The trick to writing code that scales is:

  1. remember that older hardware can run 768 threads at most per SM, newer hardware 1024. So using a blocksize of 256 (or 128) is good for occupancy (in general, this depends offcourse on how much registers your code uses)
  2. launch lots of blocks. As faster hardware can process more blocks in parallel, you need enough blocks to keep new hardware busy, the worst thing for scaling is programming for your particular hardware. If people have a card with 16 multiprocessors, they should NOT be running only 16 (or 32 or 64) blocks. A GTX280 has 30 multiprocessors, so that code would not use all of the processors on a GTX280.