cuda

The architecture organizes the work in warp (32 threads executed in parallel), but in the tesla every SM is composed of 8-core, and really are 8 threads running in parallel?

As I understand it, all 32 threads are scheduled onto an given MP, and each instruction is effectively executed four times.