## Acceleration of LU Factorization on GPGPUs

##### Abstract

Matrix operations are important bottlenecks of today's most scientific researches and computing software. Usually, due to the nature of their applications, these operations need floating point arithmetic, high memory bandwidth and processing power. Solving linear equation systems requires special matrix operations called LU factorization. SPICE, the famous circuit simulator benefits LU factorization for transient analysis of circuits. During the simulation process, LU factorization on augmented circuit matrix is performed iteratively in Newton-Raphson method to find transient voltage and current responses. As a result, this simulation may take days or weeks to be completed on a PC.

Performance studies have proved that exploiting parallelism (both data and task) of LU factorization algorithm will speed up its execution. The idea is characterizing, scheduling, distributing and synchronizing tasks to perform parallel operations on matrix rows, columns and elements (hundreds of millions of floating point multiply, add, divide and matrix displacements per second) as well as gathering and scattering huge input and result data arrays. One appropriate nominee is using Graphic Processing Unit (GPU). Although, GPU was designed for a special purpose, its numerous core architecture, processing power and memory bandwidth makes it an ideal platform to solve complex algorithms. While modern GPUs with Tera-Flops computational power, Giga-Bytes of memory capacity and hundreds of Mega-Bytes of memory bandwidth appear as perfect solution for LU factorization acceleration, the main challenges would be: execution order and synchronization of tasks, optimizing of memory access and data transfers between main CPU and GPU according to processing time and size of data arrays.

##### Current Status and Goals

Initial version of LU factorization is implemented on GPUs