Abstract: High performance implementation of matrix multiplication is essential for scientific computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication.