Wednesday, July 14 2021
1:00pm - 3:00pm
PhD Thesis Presentation
Orthogonalization Algorithms in the Context of Modern High Performance Computation and their Stability

This thesis is focused on orthogonalization schemes, their implementations on
high performance parallel computers, and their stability properties. In
short, given a set of vectors, the purpose of an orthogonalization scheme is to
generate an orthonormal basis for their span. Orthogonalizing a set of vectors
is an important part of non-symmetric iterative solvers for the solution of
linear systems or eigenvalue problems. This thesis is made of four chapters.
Chapter 1 and 2 focus on Gram-Schmidt algorithms, Chapter 3 and 4 focus on
Householder reflection based algorithms.

In Chapter 1, we present Delayed CGS2 (DCGS2), a new variant of classical
Gram-Schmidt algorithm with reorthogonalization (CGS2). The parallel
strong-scaling of iterative methods is often determined by the number of global
reductions at each iteration. For the Arnoldi algorithm, classical
Gram-Schmidt algorithm with reorthogonalization (CGS2) is a method of choice.
CGS2 requires three global reductions per iteration. Delayed CGS2 (DCGS2)
reduces the number of global reductions to one reduction per iteration and
therefore, improves the parallel strong-scaling of iterative solvers for
non-symmetric matrices such as the GMRES and the Krylov-Schur iterative
methods. Delayed CGS2 (DCGS2) employs the minimum number of global reductions
per iteration (one) for a one-column at-a-time algorithm. The main idea behind
the new algorithm is to group global reductions by rearranging the order of
operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a
GMRES solver. Numerical stability experiments assess robustness for
Krylov-Schur eigenvalue computations. Performance experiments on the ORNL
Summit supercomputer then establish the superiority of DCGS2 over CGS2.

In Chapter 2, we present an extensive stability analysis of some new
orthogonalization schemes. This is a patchwork of observations that are
interesting and relevant and consequential. In the article ``Low
synchronization GramSchmidt and generalized minimal residual algorithms''
published in Numerical Linear Algebra and its Applications in 2020,
{\'S}wirydowicz et al. present a family of orthogonalization schemes of the form
\$(I - QXQ^T)\$ where \$X\$ is a matrix to be defined. All current proposed forms
for \$X\$ are based on the matrix \$\textmd{tril}(\:Q^TQ,\:-1\:)\$, the strictly
lower triangular part of \$Q^TQ\$. In Chapter 2, part 1, the stability of many
proposed forms for \$X\$ is studied. Firstly, we show that, in our experiments,
taking \$X=( Q^TQ )^{-1}\$ provide the best scheme. Secondly, we provide
three algorithms to construct and apply \$(I - QXQ^T)\$ with \$X=( Q^TQ )^{-1}\$ in
an iterative method. Thirdly, some of these \$X\$ can be computed faster by
using first-order approximations, however we demonstrate that the first-order
approximations are of lesser quality than their original counterparts. We
conclude by recommending using \$X=( Q^TQ )^{-1}\$. In Chapter 2, part 2, we ask
the question on whether ``to reverse or not to reverse'' the Modified
Gram-Schmidt algorithm (MGS). The standard algorithm is ``not reverse'', and
its stability in finite-precision arithmetic is well understood since 1967
thanks to \AA. Bj\"orck. Reversing MGS does not make much sense at first
glance. In the context of a QR factorization for a general matrix, reversing
MGS leads, in general, to a significantly inferior level of orthogonality.
However, we observe that when the level of normality of the matrix \$A\$ is
small, as measured by the condition number of the basis of eigenvectors of \$A\$,
small being say ``the condition number of the basis of eigenvectors of \$A\$ is
less than \$10^3\$'', then, in the context of an Arnoldi expansion with the
coefficient matrix \$A\$, then ``reverse MGS'' performs significantly better than
``standard'' MGS. The first contribution is to note that the two algorithms
(MGS reverse and MGS standard) do not behave the same in finite-precision
arithmetic, and that, sometimes, surprisingly, MGS reverse performs much better
than MGS standard. The second contribution is a collection of experiments that
seem to relate this phenomenon with an Arnoldi expansion with a coefficient
matrix that is not too non-normal. A direct application of this observation
would be to reverse MGS in Arnoldi expansion when the coefficient matrix is not
too non-normal. We note that that reversing MGS in the context of low synch GS
algorithm of the form \$(I - QXQ^T)\$ is simply about using \$X^T\$ instead of \$X\$.
In Chapter 2, part 3, we study orthogonalization schemes in the context of
Arnoldi and ``large'' number of iterations. By ``large'' number of iterations,
we mean ``after GMRES'' has converged. This situation occurs often in
eigenvalue solvers. We study the singular values of the set of vectors
generated during Arnoldi. Our experiments show that CGS is not able to generate
new directions as soon as the set of vectors lose rank. In contrast, MGS is
able to generate new directions after the set of vectors loses rank, although
the new directions are generated at a slower pace than for a scheme such as
CGS2 or Householder. We explain theoretically this behavior for MGS with an
equation from the article ``A rank-\$k\$ update procedure for reorthogonalizing
the orthogonal factor from modified {G}ram-{S}chmidt'', Giraud et al., SIMAX
2004. In the context of ``full'' Arnoldi, we show that CGS2 may fail to
provide a fully orthogonal set of vectors, while Householder is able to. We
also show that MGS-Arnoldi can generate a set of vectors with dimension
\$n\$, the size of the matrix \$A\$, although the number of iterations has to be
greater than \$n\$.

In Chapter 3, we present a new algorithm for performing the Householder QR
factorization (GEQRF) of matrices. The main advantage of our new algorithm is
that it disconnects the algorithm block size (\$nb\$) with the compact WY block
size (\$mt\$). Thanks to this new algorithm we can, for example, perform a fully
column-recursive QR factorization algorithm with a limited overhead in the
number of extra operations. We can also achieve many levels of column-blocking
with an independently chosen compact WY block size. There are two main
differences between the newly proposed algorithm and the existing algorithms.
Firstly, we include the \$T\$ matrix from the compact WY representation in the
algorithm (in addition to the matrices \$V\$ and \$R\$). Secondly, we work with
``global'' index position. (So as to position ourselves correctly in the \$T\$
matrix.) This algorithm enables us to construct the \$T\$ matrix ``as we go'' and
reuse it independently of where we are in the factorization. A needed
subroutine is the ``connect two \$T\$ matrices'' from two different WY
representations. In addition, we propose a second algorithm that includes the
\$Q\$ matrix. So that the QR factorization algorithm now returns \$V\$, \$R\$, \$T\$
and \$Q\$. We demonstrate the usefulness of the new algorithms. For example, this
enables one to easily code a Block Arnoldi method which performs Cholesky QR on
the panel, and then the outer blocking scheme is performed using Householder
reflectors. Most block orthogonalization schemes use a block Gram-Schmidt
outer scheme and are pretty flexible on the inner scheme (panel). LAPACK GEQRF
uses Householder reflectors in the inner and outer scheme. Our new scheme is to
first use Householder reflectors on the outer scheme and have a flexible inner
scheme (Gram-Schmidt, Cholesky QR, TSQR, etc.). With this scheme, the cost of
constructing \$T\$ and \$Q\$ is about the cost of constructing \$T\$ or \$Q\$. In other
words, constructing \$T\$ is ``free'' if one needs \$Q\$.

In Chapter 4, we discuss an implementation of the Householder algorithm within
Trilinos using Level 2 BLAS and we provide performance results in the QR
factorization setting. The framework we work is ``one-column-at-a-time'' or
``left-looking algorithm with a block size of 1''. This is a typical framework
for iterative methods. The algorithm used is presented in the article
``Implementation of the {GMRES} Method Using {Householder} Transformations''
from Walker, SISSC, 1988. The use of Householder reflection to numerically
compute an orthogonal basis is known to be unconditionally stable. For this
reason, Householder reflections are used in most (if not all) state-of-the-art
dense numerical linear algebra packages (e.g. LAPACK). Then we present new
algorithms where we reduce the number of synchronizations of Walker's algorithm
from 5 synchronizations per step to 2.
 Speaker: Daniel Bielich Affiliation: Location: See email for Zoom