Wednesday, July 14 2021

1:00pm - 3:00pm

1:00pm - 3:00pm

PhD Thesis Presentation

Orthogonalization Algorithms in the Context of Modern High Performance Computation and their Stability

This thesis is focused on orthogonalization schemes, their implementations on

high performance parallel computers, and their stability properties. In

short, given a set of vectors, the purpose of an orthogonalization scheme is to

generate an orthonormal basis for their span. Orthogonalizing a set of vectors

is an important part of non-symmetric iterative solvers for the solution of

linear systems or eigenvalue problems. This thesis is made of four chapters.

Chapter 1 and 2 focus on Gram-Schmidt algorithms, Chapter 3 and 4 focus on

Householder reflection based algorithms.

In Chapter 1, we present Delayed CGS2 (DCGS2), a new variant of classical

Gram-Schmidt algorithm with reorthogonalization (CGS2). The parallel

strong-scaling of iterative methods is often determined by the number of global

reductions at each iteration. For the Arnoldi algorithm, classical

Gram-Schmidt algorithm with reorthogonalization (CGS2) is a method of choice.

CGS2 requires three global reductions per iteration. Delayed CGS2 (DCGS2)

reduces the number of global reductions to one reduction per iteration and

therefore, improves the parallel strong-scaling of iterative solvers for

non-symmetric matrices such as the GMRES and the Krylov-Schur iterative

methods. Delayed CGS2 (DCGS2) employs the minimum number of global reductions

per iteration (one) for a one-column at-a-time algorithm. The main idea behind

the new algorithm is to group global reductions by rearranging the order of

operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a

GMRES solver. Numerical stability experiments assess robustness for

Krylov-Schur eigenvalue computations. Performance experiments on the ORNL

Summit supercomputer then establish the superiority of DCGS2 over CGS2.

In Chapter 2, we present an extensive stability analysis of some new

orthogonalization schemes. This is a patchwork of observations that are

interesting and relevant and consequential. In the article ``Low

synchronization Gram–Schmidt and generalized minimal residual algorithms''

published in Numerical Linear Algebra and its Applications in 2020,

{\'S}wirydowicz et al. present a family of orthogonalization schemes of the form

$(I - QXQ^T)$ where $X$ is a matrix to be defined. All current proposed forms

for $X$ are based on the matrix $\textmd{tril}(\:Q^TQ,\:-1\:)$, the strictly

lower triangular part of $Q^TQ$. In Chapter 2, part 1, the stability of many

proposed forms for $X$ is studied. Firstly, we show that, in our experiments,

taking $X=( Q^TQ )^{-1}$ provide the best scheme. Secondly, we provide

three algorithms to construct and apply $(I - QXQ^T)$ with $X=( Q^TQ )^{-1}$ in

an iterative method. Thirdly, some of these $X$ can be computed faster by

using first-order approximations, however we demonstrate that the first-order

approximations are of lesser quality than their original counterparts. We

conclude by recommending using $X=( Q^TQ )^{-1}$. In Chapter 2, part 2, we ask

the question on whether ``to reverse or not to reverse'' the Modified

Gram-Schmidt algorithm (MGS). The standard algorithm is ``not reverse'', and

its stability in finite-precision arithmetic is well understood since 1967

thanks to \AA. Bj\"orck. Reversing MGS does not make much sense at first

glance. In the context of a QR factorization for a general matrix, reversing

MGS leads, in general, to a significantly inferior level of orthogonality.

However, we observe that when the level of normality of the matrix $A$ is

small, as measured by the condition number of the basis of eigenvectors of $A$,

small being say ``the condition number of the basis of eigenvectors of $A$ is

less than $10^3$'', then, in the context of an Arnoldi expansion with the

coefficient matrix $A$, then ``reverse MGS'' performs significantly better than

``standard'' MGS. The first contribution is to note that the two algorithms

(MGS reverse and MGS standard) do not behave the same in finite-precision

arithmetic, and that, sometimes, surprisingly, MGS reverse performs much better

than MGS standard. The second contribution is a collection of experiments that

seem to relate this phenomenon with an Arnoldi expansion with a coefficient

matrix that is not too non-normal. A direct application of this observation

would be to reverse MGS in Arnoldi expansion when the coefficient matrix is not

too non-normal. We note that that reversing MGS in the context of low synch GS

algorithm of the form $(I - QXQ^T)$ is simply about using $X^T$ instead of $X$.

In Chapter 2, part 3, we study orthogonalization schemes in the context of

Arnoldi and ``large'' number of iterations. By ``large'' number of iterations,

we mean ``after GMRES'' has converged. This situation occurs often in

eigenvalue solvers. We study the singular values of the set of vectors

generated during Arnoldi. Our experiments show that CGS is not able to generate

new directions as soon as the set of vectors lose rank. In contrast, MGS is

able to generate new directions after the set of vectors loses rank, although

the new directions are generated at a slower pace than for a scheme such as

CGS2 or Householder. We explain theoretically this behavior for MGS with an

equation from the article ``A rank-$k$ update procedure for reorthogonalizing

the orthogonal factor from modified {G}ram-{S}chmidt'', Giraud et al., SIMAX

2004. In the context of ``full'' Arnoldi, we show that CGS2 may fail to

provide a fully orthogonal set of vectors, while Householder is able to. We

also show that MGS-Arnoldi can generate a set of vectors with dimension

$n$, the size of the matrix $A$, although the number of iterations has to be

greater than $n$.

In Chapter 3, we present a new algorithm for performing the Householder QR

factorization (GEQRF) of matrices. The main advantage of our new algorithm is

that it disconnects the algorithm block size ($nb$) with the compact WY block

size ($mt$). Thanks to this new algorithm we can, for example, perform a fully

column-recursive QR factorization algorithm with a limited overhead in the

number of extra operations. We can also achieve many levels of column-blocking

with an independently chosen compact WY block size. There are two main

differences between the newly proposed algorithm and the existing algorithms.

Firstly, we include the $T$ matrix from the compact WY representation in the

algorithm (in addition to the matrices $V$ and $R$). Secondly, we work with

``global'' index position. (So as to position ourselves correctly in the $T$

matrix.) This algorithm enables us to construct the $T$ matrix ``as we go'' and

reuse it independently of where we are in the factorization. A needed

subroutine is the ``connect two $T$ matrices'' from two different WY

representations. In addition, we propose a second algorithm that includes the

$Q$ matrix. So that the QR factorization algorithm now returns $V$, $R$, $T$

and $Q$. We demonstrate the usefulness of the new algorithms. For example, this

enables one to easily code a Block Arnoldi method which performs Cholesky QR on

the panel, and then the outer blocking scheme is performed using Householder

reflectors. Most block orthogonalization schemes use a block Gram-Schmidt

outer scheme and are pretty flexible on the inner scheme (panel). LAPACK GEQRF

uses Householder reflectors in the inner and outer scheme. Our new scheme is to

first use Householder reflectors on the outer scheme and have a flexible inner

scheme (Gram-Schmidt, Cholesky QR, TSQR, etc.). With this scheme, the cost of

constructing $T$ and $Q$ is about the cost of constructing $T$ or $Q$. In other

words, constructing $T$ is ``free'' if one needs $Q$.

In Chapter 4, we discuss an implementation of the Householder algorithm within

Trilinos using Level 2 BLAS and we provide performance results in the QR

factorization setting. The framework we work is ``one-column-at-a-time'' or

``left-looking algorithm with a block size of 1''. This is a typical framework

for iterative methods. The algorithm used is presented in the article

``Implementation of the {GMRES} Method Using {Householder} Transformations''

from Walker, SISSC, 1988. The use of Householder reflection to numerically

compute an orthogonal basis is known to be unconditionally stable. For this

reason, Householder reflections are used in most (if not all) state-of-the-art

dense numerical linear algebra packages (e.g. LAPACK). Then we present new

algorithms where we reduce the number of synchronizations of Walker's algorithm

from 5 synchronizations per step to 2.

high performance parallel computers, and their stability properties. In

short, given a set of vectors, the purpose of an orthogonalization scheme is to

generate an orthonormal basis for their span. Orthogonalizing a set of vectors

is an important part of non-symmetric iterative solvers for the solution of

linear systems or eigenvalue problems. This thesis is made of four chapters.

Chapter 1 and 2 focus on Gram-Schmidt algorithms, Chapter 3 and 4 focus on

Householder reflection based algorithms.

In Chapter 1, we present Delayed CGS2 (DCGS2), a new variant of classical

Gram-Schmidt algorithm with reorthogonalization (CGS2). The parallel

strong-scaling of iterative methods is often determined by the number of global

reductions at each iteration. For the Arnoldi algorithm, classical

Gram-Schmidt algorithm with reorthogonalization (CGS2) is a method of choice.

CGS2 requires three global reductions per iteration. Delayed CGS2 (DCGS2)

reduces the number of global reductions to one reduction per iteration and

therefore, improves the parallel strong-scaling of iterative solvers for

non-symmetric matrices such as the GMRES and the Krylov-Schur iterative

methods. Delayed CGS2 (DCGS2) employs the minimum number of global reductions

per iteration (one) for a one-column at-a-time algorithm. The main idea behind

the new algorithm is to group global reductions by rearranging the order of

operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a

GMRES solver. Numerical stability experiments assess robustness for

Krylov-Schur eigenvalue computations. Performance experiments on the ORNL

Summit supercomputer then establish the superiority of DCGS2 over CGS2.

In Chapter 2, we present an extensive stability analysis of some new

orthogonalization schemes. This is a patchwork of observations that are

interesting and relevant and consequential. In the article ``Low

synchronization Gram–Schmidt and generalized minimal residual algorithms''

published in Numerical Linear Algebra and its Applications in 2020,

{\'S}wirydowicz et al. present a family of orthogonalization schemes of the form

$(I - QXQ^T)$ where $X$ is a matrix to be defined. All current proposed forms

for $X$ are based on the matrix $\textmd{tril}(\:Q^TQ,\:-1\:)$, the strictly

lower triangular part of $Q^TQ$. In Chapter 2, part 1, the stability of many

proposed forms for $X$ is studied. Firstly, we show that, in our experiments,

taking $X=( Q^TQ )^{-1}$ provide the best scheme. Secondly, we provide

three algorithms to construct and apply $(I - QXQ^T)$ with $X=( Q^TQ )^{-1}$ in

an iterative method. Thirdly, some of these $X$ can be computed faster by

using first-order approximations, however we demonstrate that the first-order

approximations are of lesser quality than their original counterparts. We

conclude by recommending using $X=( Q^TQ )^{-1}$. In Chapter 2, part 2, we ask

the question on whether ``to reverse or not to reverse'' the Modified

Gram-Schmidt algorithm (MGS). The standard algorithm is ``not reverse'', and

its stability in finite-precision arithmetic is well understood since 1967

thanks to \AA. Bj\"orck. Reversing MGS does not make much sense at first

glance. In the context of a QR factorization for a general matrix, reversing

MGS leads, in general, to a significantly inferior level of orthogonality.

However, we observe that when the level of normality of the matrix $A$ is

small, as measured by the condition number of the basis of eigenvectors of $A$,

small being say ``the condition number of the basis of eigenvectors of $A$ is

less than $10^3$'', then, in the context of an Arnoldi expansion with the

coefficient matrix $A$, then ``reverse MGS'' performs significantly better than

``standard'' MGS. The first contribution is to note that the two algorithms

(MGS reverse and MGS standard) do not behave the same in finite-precision

arithmetic, and that, sometimes, surprisingly, MGS reverse performs much better

than MGS standard. The second contribution is a collection of experiments that

seem to relate this phenomenon with an Arnoldi expansion with a coefficient

matrix that is not too non-normal. A direct application of this observation

would be to reverse MGS in Arnoldi expansion when the coefficient matrix is not

too non-normal. We note that that reversing MGS in the context of low synch GS

algorithm of the form $(I - QXQ^T)$ is simply about using $X^T$ instead of $X$.

In Chapter 2, part 3, we study orthogonalization schemes in the context of

Arnoldi and ``large'' number of iterations. By ``large'' number of iterations,

we mean ``after GMRES'' has converged. This situation occurs often in

eigenvalue solvers. We study the singular values of the set of vectors

generated during Arnoldi. Our experiments show that CGS is not able to generate

new directions as soon as the set of vectors lose rank. In contrast, MGS is

able to generate new directions after the set of vectors loses rank, although

the new directions are generated at a slower pace than for a scheme such as

CGS2 or Householder. We explain theoretically this behavior for MGS with an

equation from the article ``A rank-$k$ update procedure for reorthogonalizing

the orthogonal factor from modified {G}ram-{S}chmidt'', Giraud et al., SIMAX

2004. In the context of ``full'' Arnoldi, we show that CGS2 may fail to

provide a fully orthogonal set of vectors, while Householder is able to. We

also show that MGS-Arnoldi can generate a set of vectors with dimension

$n$, the size of the matrix $A$, although the number of iterations has to be

greater than $n$.

In Chapter 3, we present a new algorithm for performing the Householder QR

factorization (GEQRF) of matrices. The main advantage of our new algorithm is

that it disconnects the algorithm block size ($nb$) with the compact WY block

size ($mt$). Thanks to this new algorithm we can, for example, perform a fully

column-recursive QR factorization algorithm with a limited overhead in the

number of extra operations. We can also achieve many levels of column-blocking

with an independently chosen compact WY block size. There are two main

differences between the newly proposed algorithm and the existing algorithms.

Firstly, we include the $T$ matrix from the compact WY representation in the

algorithm (in addition to the matrices $V$ and $R$). Secondly, we work with

``global'' index position. (So as to position ourselves correctly in the $T$

matrix.) This algorithm enables us to construct the $T$ matrix ``as we go'' and

reuse it independently of where we are in the factorization. A needed

subroutine is the ``connect two $T$ matrices'' from two different WY

representations. In addition, we propose a second algorithm that includes the

$Q$ matrix. So that the QR factorization algorithm now returns $V$, $R$, $T$

and $Q$. We demonstrate the usefulness of the new algorithms. For example, this

enables one to easily code a Block Arnoldi method which performs Cholesky QR on

the panel, and then the outer blocking scheme is performed using Householder

reflectors. Most block orthogonalization schemes use a block Gram-Schmidt

outer scheme and are pretty flexible on the inner scheme (panel). LAPACK GEQRF

uses Householder reflectors in the inner and outer scheme. Our new scheme is to

first use Householder reflectors on the outer scheme and have a flexible inner

scheme (Gram-Schmidt, Cholesky QR, TSQR, etc.). With this scheme, the cost of

constructing $T$ and $Q$ is about the cost of constructing $T$ or $Q$. In other

words, constructing $T$ is ``free'' if one needs $Q$.

In Chapter 4, we discuss an implementation of the Householder algorithm within

Trilinos using Level 2 BLAS and we provide performance results in the QR

factorization setting. The framework we work is ``one-column-at-a-time'' or

``left-looking algorithm with a block size of 1''. This is a typical framework

for iterative methods. The algorithm used is presented in the article

``Implementation of the {GMRES} Method Using {Householder} Transformations''

from Walker, SISSC, 1988. The use of Householder reflection to numerically

compute an orthogonal basis is known to be unconditionally stable. For this

reason, Householder reflections are used in most (if not all) state-of-the-art

dense numerical linear algebra packages (e.g. LAPACK). Then we present new

algorithms where we reduce the number of synchronizations of Walker's algorithm

from 5 synchronizations per step to 2.

Speaker: | Daniel Bielich |

Affiliation: | |

Location: | See email for Zoom |

Done