Hi,
I can do exactly what is the requirement, I have enough experience in CUDA developpement, I can even tell you before developing the application, why the fifth option is the fastest.
5) CUDA + No Array transpose + Tiling (the fastest)
* Tiling will minimize the costly global memory reading, reading from shared memory is fast.
* No Array transpose: For each matrix element we will sum up the multiplication of A matrix columns by B matrix columns (instead of B matrix rows), Reading matrix by columns is faster than reading it by rows, since will have a sequential reads and take advantage of CUDA memory cache.
As a proof of concept after developing the application, we will do some CUDA profiling (CUDA Visual Profiler) and get some nice explanations.
You will need a Nvidia graphic card & CUDA Toolkit 5.5 (Not necessary the same version bu preferable)
My [login to view URL] Output
Device 0: "GeForce GTX 660"
CUDA Driver Version / Runtime Version 5.5 / 5.5
Regards
Marouane