Decompositional approach
Easy to analyze stability. Can reuse decomposition to solve multiple instances of the problem: eg consider A = LU.
Condition number of a matrix
Condition number of Ax when x perturbed
\(k = \sup_{\change x} \frac{\norm{A\change x}}{\norm{\change x}}\frac{\norm{x}}{\norm{Ax}} = \norm{A}\frac{\norm{x}}{\norm{Ax}} \leq \norm{A}\norm{A^{-1}}\) or \(\leq \norm{A}\norm{A^{+}}\) if A not square.
So, condition of \(A^{-1}b\) when b perturbed, \(k = \norm{A^{-1}}\frac{\norm{Ax}}{\norm{x}} \leq \norm{A}\norm{A^{-1}}\).
Condition number of matrix wrt norm
\(k(A) = \norm{A}\norm{A^{-1}}\).
2 norm condition number: \(\frac{\sw_{1}}{\sw_{n}}\). So, here, \(k(A) = k(A^{T})\).
Ill conditioned matreces. Loose \(\log k(A)\) digits of accuracy: change 1 bit in \(\frac{\norm{\change x}}{\norm{x}}\), change in \(\frac{\norm{A\change x}}{\norm{Ax}}\) is worth \(\log k(A)\) bits.
Condition of \htext{\(x=A^{-1b\)}{inverse problem} when A perturbed}
\((A+\change A)(x+\change x) = b\). So, ignoring \(\change A \change x\), \(\change x = -A^{-1}(\change A)x\). By Cauchy Schwartz, \(\norm{\change x} = \norm{A^{-1}}\norm{(\change A)}\norm{x}\). So, \(k = sup_{\change A}\frac{\norm{\change x }\norm{A}}{\norm{\change A}\norm{x}} \leq \norm{A}\norm{A^{-1}}\).
Solving Ax=b for x
Assumption
\(b \in range(A)\).
Stability and expense
Using SVD is most reliable, but expensive. LU is cheapest, QR is more expensive, but more reliable. Usually, look at condition number k(A) and decide method.
Triangularization by row elimination
Using PA=LU
Make augmented matrix \(X = \mat{ A & b}\) corresponding to problem Ax = b; Use row operations to reduce X to \(\mat{ U & L^{-1}b}\): thence get the problem \(U x = L^{-1}b\); do back substitution: to get : \(\mat{ I & UL^{-1}b}\) corresponding to problem \(Ix = UL^{-1}b\). Thus got \(x_r \in range(A^{T})\).
Then, to get set of all solutions: \(\set{x_r + x_n: x_n \in N(A)}\).
Cost
If you have L,U: Solve LUx = b by solving Ly = b: forward substitution: \(O(m^{2} flops)\); then solving Ux=y :back substitution: \(O(m^{2} flops)\).
Stability
Back substitution is stable.
With A=QR
Using Householder reflections
Get A=QR; get \(y = Q^{*}b\); get \(x = R^{-1}y\). For stability, householder reflections used to get A=QR.
Note: \(y = Q^{}b\) implicitly found as part of using householder reflections to get \(R = Q^{}A\) by instead doing \(\mat{R & y} = Q^{*}\mat{A & b}\).
Backward stability
For \(\frac{\norm{\Del A}}{\norm{A}} = O(\eps)\), \((A+ \Del A)\hat{x} = b\): use \(A + \del A= \tilde{Q}\tilde{R}\) from backward stability of Householder; \((\tilde{Q}+\del Q)y = b\); \((\tilde{R} + \del R)\hat{x} = y\); set \(\Del A = \del A + \del Q\tilde{R} + \tilde{Q}\del R\).
So accuracy: \(\frac{\norm{\del x}}{\norm{x}}=k(A)\eps\).
Use SVD
Take \(Ax = U\SW V^{*} x = b\), each of these is easy to invert. Costly, but very accurate: \(\SW\) reveals numerical rank, in case of very small \(sw_{i}\), drop the corresponding \(u_{i}\) or \(v_{i}\) and get a well conditioned problem.
Use determinants (Impractical)
(Cramer) Let C be cofactor matrix : \(C_{i,j}\) multiple of \(A_{i,j}\) in \(|A|\) formula; \(C^{T}Ax = C^{T}b\); so \(x = \frac{C^{T}b}{det(A)}\); so for \(x_{j}\), find jth row of \(C^{T}\) (= cofactors of jth col of A) times b = det (A where b replaces jth col).
Find the inverse
Find \(A^{-1}b\): A change of basis operation, saying b in terms of C(A) rather than \(e_{i}\)’s.
For Hermitian +ve semidefinite \htext{\(X=A^{*A\)}{..}}
Use LU/ Cholesky
Use \(X = R^{*}R\). Time needed to solve two triangular systems: \(O(m^{2})\); time needed to make R: \(O(m^{3}/3)\).
Stability
Diagonal heavy if \(X \succ 0\): so no pivoting required. \why Cholesky alg to make R backward stable. So, \((A+\del A) \tilde{x} = b\) for relatively small \(\del A\).
Avoiding \htext{\(X=A^{*A\)}{..} if A known}
In doing \(A^{*}A\), you square the condition number. So, if A is ill conditioned, you get an even more ill conditioned problem.
So, don’t do this. Instead, replace A with LU or QR or \(U\SW V^{*}\) and solve.
Use A=QR, if A known
Get \(A=\hat{Q}\hat{R}\) (\(2mn^{2} - \frac{2}{3}n^{3}\) flops), solve \(Rx=\hat{Q}^{*}b\). More numerically stable than Cholesky \why.
Get \(A=\hat{Q}\hat{R}\), so \(A^{}A = \hat{R}^{}\hat{R}\) (Cholesky: LU for \(A^{}A\)); get \(\hat{R}^{}\hat{R}x = A^{}b\). Implementation: solve \(\hat{R}^{}w=A^{}b\), then \(Rx=w\). Finding \(A^{}A\) + Cholesky = \(mn^{2} + \frac{n^{3}}{3}\) flops.
Diagonalize and solve
Find thin SVD (\(2mn^{2} + 11n^{3}\) flops): \(A=\hat{U}\hat{\SW}V^{}\): \(P=A(A^{}A)^{-1} A^{} = \hat{U}\hat{U}^{}\); so \(\hat{U}\hat{\SW}x = \hat{U}\hat{U}^{}b\), \(\hat{\SW}V^{}x = \hat{U}^{*}b\); very dependable.
Iteratively solve Ax=b
Get x one component at a time
Take A = D + L + U, with D diagonal, L/ U strictly lower/ upper triangular.
Using updated components immediately
Aka Gauss Siedel.
Take \(x^{(k)}\), get \(x^{(k+1)}\) by this:\ find \(x^{(k+1)}{j}\) using \(A, b, x^{(k+1)}{1:j-1}, x^{(k)}_{j+1:n}\). So, \(Dx^{k+1} = b - Lx^{k+1} - Ux^{k}\); \((D+L)x^{k+1} = b - Ux^{k}\).
Not guaranteed to converge.
Use only old guesses of x
Aka Jacobi iteration.
Take \(x^{(k)}\), get \(x^{(k+1)}\) by doing this: find \(x^{(k+1)}{j}\) using \(A, b, x^{(k)}{i\neq j}\) : so using old iterates of x uniformally rather than some old and some new iterates in finding \(x^{(k+1)}_{j}\). So, \(Dx^{k+1} = b - Lx^{k} - Ux^{k}\).
Guaranteed to converge.
Using \htext{\((I-A)^{-1\)}{Neumann} series for square A}
(Neumann) See linear algebra ref. Can be used in solving \(A’x = b\) where \(A’ = (I-A)\).
Overdetermined system of equations: Least squares solution
Problem
Solve \(\min_x \norm{Ax-b}\), given \(m>n\), b not in C(A). Error vector \(e=A\hat{x}-b\). We want to \(\min_x \norm{e}^{2}\).
For versions with regularizers, weights etc.. see optimization ref.
Importance
Solving this for the case where 2 norm is used in the problem specification \(\equiv\) orthogonal projection.
Useful also in linear regression; in that context also the maximum likelihood solution for data generated by a linear fn in the presense of Gaussian noise: see statistics, optimization ref.
Solution
Projection + inverse operation
Two steps: First, project \(b\) to \(range(A)\); that is, find \\(b’ = \argmin_{v \in ran(A)} \norm{b-v}\). Then, solve \(Ax = b’\).
When error magnitude is measured wrt some other norms, other, perhaps oblique projections may be needed.
Solution form: 2-norm minimization
\(\norm{Ax-b}^{2} = (Ax-b)^{}(Ax-b)\), set \(\gradient f(x) = 0\) to find minimum. This is equivalent to solving \ \(A^{}(b-A\hat{x})=0\) or \(A^{}A\hat{x}=A^{}b\) (Normal equations).
As orthogonal projection + inverse
Note that the condition \(A^{*}(b-A\hat{x}) = 0\) matches the condition from geometric intuition that the error vector \(e \perp ran(A)\).
Solution algorithm
The projection and inversion can either be done together or separately. In the former case, projection can be accomplished by finding an orthogonal basis for \(ran(A)\) using either QR or SVD.
Non-singular A
\((A^{}A)^{-1}\) invertible. \(\hat{x}=(A^{}A)^{-1}A^{*}b, p = A\hat{x}\).
\(P=A(A^{}A)^{-1} A^{}\). \((A^{}A)^{-1}=A^{-1}(A^{})^{-1}\) iff A is square and both exist.
Pseudoinverse for non-singular A
See linear algebra ref.
Rank deficient A
\((A^{}A)^{-1}\) not invertible, rank deficient. \pf \(N(A^{}A) = N(A)\): \(A^{}Ax = 0 \implies x^{}A^{*}Ax = \norm{Ax} = 0\). I-P projects to \(N(A^{T}\)) (not N(A)): \((I-P)b=e\).
Solution
\(A^{}A\hat{x}=A^{}b\) has many solutions for \(x\). This is equivalent to the solution obtained by projecting \(b\) to \(ran(A)\) to get \(b’\) and then solving \(Ax = b’\).
Triangularization by row elimination
Aka Gaussian elimination. Get PA=LDU where L is unit lower triangular, D is diagonal, U is unit upper triangular. This is unique: see linear algebra ref.
Algorithm
In step i, subtract multiples of row \(A_{i,:}\) from rows \(A_{i+1:m,:}\) so as to make \(A_{i+1:m, i}\) subcolumn 0. These row operations correspond to doing \(L^{-1}A = U\) to get U from A.
Pivoting
But, maybe \(A_{i,i} = 0\). In this case, we need to bring row k with \(A{k,i} \neq0\) in place of row i for the algorithm to proceed.
Pivoting is also needed for stability of the algorithm.
Cost
Find L, U (\(\frac{2m^{3}}{3}\) flops).
Various Formulations
Reordering the loops: ijk version puts 0’s column-wise. ikj version puts 0’s row-wise, has good storage properties \why.
Reducing memory usage
Overwrite A with L and U.
Outer product formulation
\(A_{2:m,2:m}-\frac{A_{2:m,1}A_{1,2:m}^{*}}{a_{1,1}}\).
Reducing memory access
Let \(l_{k} = 0+A_{k+1:m,k}\) after k-1 steps of triangularization; then k+1th step of traing is to make \(L^{-1}{k} = I-l{k}e_{k}^{*}; A=L^{-1}{k}A\). Reducing A to U is to reduce it to \(\mat{U{1,1} & U_{1,2}\ 0 & U_{2,2}}\); can do this repeatedly. So make yer alg use this decomposition, and block \(\times\) as basic ops, to avoid having to get each col to cache.
Instablity when no pivoting
\(A=\mat{10^{-20} & 1\ 1 & 1}\) yields \(\tilde{L}=\mat{1 & 0\ 10^{20} & 1}\) and \(\tilde{U}=\mat{10^{-20} & 1\ 0 & -10^{20}}\) with \(A-\tilde{L}\tilde{U}\) big: so stable, but not backward stable (considering functions of m or A in error bound).
Instability when, resulting from pivot very small wrt other elements in A, element t in \(\tilde{L}\) or \(\tilde{U}\) huge: relative error \(O(\eps)\) but abs error \(O(\eps t)\).
\(|\Del A| \leq 3n\eps|L||U|\). \why \(\tilde{L}\tilde{U} = A + \Del A =LU + \Del A, \frac{\norm{\Del A}}{\norm{L}\norm{U}} = O(\eps)\).
Partial pivoting (GEPP)
At step k, pivot selected as biggest element in \(A_{k:m,k}\). Take \(\prod L_{i}P_{i}A = U\) where \(L_{i}\) are atomic unit lower triangular, use \(PLP^{*} = L’\) (row exchange in L = col exchange in L’), get PA = LU.
Stability of GEPP
\(L_{i,j} \leq 1; \norm{L} =O(1)\); let Growth Factor \(\rho = \frac{\max |U_{i,j}|}{\max |A_{i,j}|}\); \(\norm{U} = O(\rho \norm{A})\); so \(\tilde{L}\tilde{U} = \tilde{P}A + \Del A, \frac{\norm{\Del A}}{\norm{A}} = O(\rho \eps)\).
Maximal instability: \(\rho = 2^{m-1}\): Eg: \(A=\mat{1 & 1\ -1 & 1}\); m-1 digits of accuracy lost. Unstability occurs very rarely; usually \(\rho \leq m^{-0.5}\). Used in Matlab \ op.
Complete pivoting
At step k, pivot selected as biggest element in \ \(A_{k:m,k:m}\). \(O(m^{3}) = \sum i^{2}\) flops: expensive. \(PAP’ = LU\). Stable.
Avoidance of pivoting
If X is +ve definite, no pivoting required: As all principle submatrices are +ve definite - so non-singular, X = LU exists.
Also, if X is diagonally dominant, diagonal dominance is preserved during triangularization. So, it does not require pivoting.
Formula for pivots
Let \(A_{k}\) be submatrix of first k*k elements; then from block multiplication, \(P_{k}A_{k} = L_{k}D_{k}U_{k}\) holds; so pivots can also be found by \(\frac{|D_{k}|}{|D_{k-1}|} = \frac{|A_{k}|}{|A_{k-1}|}\).
Symmetric Elimination Algorithm for spd A
Do Gaussian elimination + extra column ops to diagonalize/ maintain symmetry at each step.
\[ A = \mat{a_{1,1} & A_{2,1}^{}\ A_{2,1} & A_{2,2}} = \mat{1 & 0\ \frac{A_{2,1}}{a_{1,1}} & I} \mat{a_{1,1} & 0\ 0 & A_{2,2}-\frac{A_{2,1}A_{2,1}^{}}{a_{1,1}}} \mat{1 & \frac{A_{2,1}}{a_{1,1}}\ 0 & I} =LDL^{*} \]
Get \(R^{*}R\) by doing \(LD^{1/2}\) at each step.
Code and Opcount
R=A; Repeat: do symmetric elimination on submatrix \(R_{i+1,i+1}\); do \(R_{i}^{*}/\sqrt{r_{i,i}}\). Only Upper part of R stored.
Opcount: \(\sum_{k=1}^{m} \sum_{j=k+1}^{m} 2(m-j) \approx \frac{m^{3}}{3}\) flops.
Stability
By SVD: \(\norm{R}{2} = \norm{R^{*}}{2} = \norm{A}_{2}^{1/2}\); \(so \norm{R} \leq \sqrt{m}\norm{A}\) \chk. So, R never grows large. So, backward stable : get \(\hat{R}\hat{R}\) for perturbed A. Forward error in R large; but R and \(R^{}\) diabolically correlated.
A=QR
Triangular orthonormalization
Take the m*n matrix A; arrive at the matrix \(Q_n = A\hat{R}\). At step j, you have \(Q_{:,1:j-1} = Q_{j-1}\) find the direction of \(q_j\) by removing the component of \(a_j\) in the subspace spanned by \(Q_{j-1}\).
Thence, you arrive at reduced QR: \(A=Q_n\hat{R}\), where \(Q_n\) is a m*n matrix. Can extend Q thence to be a square matrix: this is full QR.
So, QR \(\exists \forall A\). If sign(\(R_{i,i}\)) is fixed to be +ve, QR unique.
Gram-Schmidt classical
At step j: Take \(a_j-\hat{Q}{j-1}\hat{Q^{*}}{j-1}a_j\) and normalize it to get \(q_j\), with \(\hat{Q}_{j-1} = [q_1 .. q_j]\).
\(2mn^{2}\) flops.
Instability
Even by the time it calculates \(q_{20}\), error becomes unbearable.
Double gram-schmidt
Get A = QR, then do Q = Q’R’ to re-orthogonalize Q. A surer way of getting orthogonal basis for range(A).
Gram-Schmidt Modified (MGS)
At step j: Remove the component of \(a_j\) first in the subspace \(\linspan{q_1}\), then remove the component of the residue from \(\linspan{q_2}\) and so on. Algebraically same as classical version: \(Q_{j-1} = \sum_1^{j-1}q_iq_i^{*}\). Computational cost same as classical version.
Round off error
very small angle twixt \(q_{1}, a_{2}\); so \(q’{2}=a{2}-q_{1}q_{1}*a_{2}\) very small, with smaller error (maybe \(10^{-15}\) in \(q’{2,j}\) wrt \(q’{2,k}\)); err amplified maybe \(10^{10}\) times when \(q_{2}\) made after normalization. MGS has lesser roundoff error: \why.
Orthogonal triangularization
(Householder) Do \(Q^{}A = \hat{R}\), not \(A\hat{R}^{-1} = \hat{Q}\). Init: R=A; \(Q^{} = Q_{n} .. Q_{1}\); \(Q_{k} = \mat{I_{k-1} & 0 \ 0 & F}\) leaves \(r_{1} .. r_{k-1}\) and \(r_{1}^{} .. r_{k-1}^{}\) alone, Householder reflector F reflects x (last m-k+1 entries in \(r_{k}\)) to \(\norm{x}e_{1}\); last entries in \(r_{k+1} .. r_{n}\) elsewhere. F reflects accross plane \(\perp v = x - \norm{x}e_{1}\) or \(\perp v’ = -x - \norm{x}e_{1}\); so by geometry, \(\frac{vv^{}}{\norm{v}}x\) is projection on v; \(\norm{x}e_{1} = Fx = (I - 2\frac{vv^{}}{\norm{v}})x\).
To avoid catastrophic cancellation when \(x - \norm{x}e_{1}\) very small, choose \(v = -sign(x_{1})x - \norm{x}e_{1}\). \(F^{} = F, so Q_{k}^{}=Q\). Needs: \(2mn^{2} - \frac{2}{3}n^{3}\) flops.
Stability of finding Q, R
Calculated \(\hat{Q}\) and \(\hat{R}\) can have large forward errors; but they’re diabolically correlated; backward error or residual \(A -\hat{Q}\hat{R}\) very small. \(A + \Del A = \hat{Q}\hat{R}\) for \(\frac{\norm{\Del A}}{\norm{A}} = O(\eps)\). So, backward error analysis best way to proceed.
Find eigenvalues
Hardness
Can’t get directly for \(m\geq 5\): thm from Galois theory that roots can’t be expressed as radicals etc.. So, iterative part necessary in alg. 2 phases: Preliminary reduction to structured form; then iterative part.
Usual approach
Easily reduce to almost-triangular form using similarity transformations. Then, use more similarity transformations to iteratively get close to triangular form.
Finding a single ew
(Dhillon) If you have an idea about the approximate size of the ew, you can find it in \(O(n^{2})\). Else, if ye need kth ew, it takes \(O(kn^{2})\) time.
For sparse A, it is much cheaper. \tbc
Use characteristic polynomial
Find roots using rootfinder. So, every ew has one non 0 ev. Ill conditioned. Eg: If coefficients in \(x^{2}+2x-1\) change by \(\eps\), x changes by \(O(\sqrt{\eps})\).
Reduction to Upper Hessenberg matrix H
0’s below first sub diagonal. If \(A=A^{}\), get Tridiagonal matrix. Use Householder reflections: \ \(H = (\prod_{i} Q_{m-1-i}^{})A (\prod^{m-2}{i=1} Q{i})\).
These are similarity transformations, so \(\ew(H) = \ew(A)\).
For large sparse matreces: Use Arnoldi iteration.
Op count
Row ops, at 4 flops per num: \(\frac{4 m^{3}}{3}\); Col ops, at 4 flops per num: \(2 m^{3}\); total: \(\frac{10 m^{3}}{3}\). Reduced work if \(A=A^{*}\): \(\frac{4 m^{3}}{3}\).
Stability
\(\tilde{Q}\tilde{H}\tilde{Q}^{*} = A + \del A\) for relatively small \(\del A\).
Approach Eigenvalue revealing factorizations
$$A = QUQ^{}; \ U = ..Q_{2}^{}Q_{1}^{*}AQ_{1}Q_{2}.$$
If \(A=A^{*}\), this leads to unitary diagonalization.
Power iteration for real symmetric A
The series \(v^{(i)} = \frac{A^{i}x}{\norm{A^{i}x}}\) and \(l^{(i)} = r(v^{(i)})\) converge to eigenpair corresponding to largest ew \(\ew_{1}, q_{1}\): as \(x = \sum a_{i}q_{i}\).
So, Applying A repeatedly takes x to dominant ev.
Convergence
Linear convergence of ev. \[ \norm{v^{(i)} - \pm q_{1}} = O(|\frac{\ew_{2}}{\ew_{1}}|^{i}),\ \norm{\ew^{(i)} - \pm \ew_{1}} = \norm{v^{(i)} - \pm q_{1}}^{2} \]
Inverse iteration
ev of A and \((A-pI)^{-1}\) same, ew \(\ew_{i}\) shifted and inverted to get ew \((\ew_{i} - p)^{-1}\). If p near \(\ew_{j}\), using power iteration on \((A-pI)^{-1}\) gives fast convergence.
Good for finding ev if ew already known.
Convergence
Linear convergence of ev. \ \(\norm{v^{(i)} - \pm q_{j}} = O(|\frac{p-\ew_{j}}{p-\ew_{k}}|^{i}),\ \norm{\ew^{(i)} - \pm \ew_{1}} = \norm{v^{(i)} - \pm q_{j}}^{2}\).
Alg
Solve \((A-pI)w = v^{(k-1)}\); normalize to get \(v^{(k)}\).
Rayleigh quotient iteration
Inverse iteration, where \(\ew^{(i)} = R(v^{(i)})\) used as p (ew estimate).
Convergence
Cubic convergence of ev and ew. If \(\norm{v^{(k)} - q_{j}} \leq eps\) when \(|\ew^{(k)} - \ew_{j}| \leq O(\eps^{2})\). So \(\norm{v^{(k+1)} - q_{j}} = O(|\ew^{(k)} - \ew_{j}|\norm{v^{(k)} - q_{j}}) =\ O(\norm{v^{(k)} - (\pm q_{j})}^{3})\). \(|\ew^{(k+1)} - (\ew_{j})| = O(\norm{v^{(k+1)} - q_{j}}^{2}) = O(|\ew^{(k)} - (\pm q_{j})|^{3})\).
Gain 3 digits of accuracy in each iteration.
Simultaneous iteration for real symmetric A
Aka Block power itern. \(\tuple{v_{i}}\) linearly independent; their matrix \(V^{(0)}\). \(\tuple{q_{i}}\) orth ev of A; cols of \(\tilde{Q}\).
Unstable. \why
Convergence
If \(|\ew_{1}| > .. > |\ew_{n}| \geq |\ew_{n+1}|..\), Orth basis of \(\linspan{A^{k}v_{1}^{(0)}, .. A^{k}v_{n}^{(0)}}\) converges to \(\linspan{q_{1}, .. q_{n}}\): take \(v_{i} = \sum_{j}a_{j}q_{j}\), do power iteration.
Alg
Take some \(Q^{0} = I\) or other orth cols, get \(Z = AQ^{(k-1)}\); get \(Q^{(k)}R^{(k)} = Z\). Defn: \(A^{(k)} = (Q^{(k)})^{T}AQ^{(k)}\), \(R’^{(k)} = \prod R^{(k)}\).
\(A^{k} = Q^{(k)}R’^{(k)}\): By induction: \(A^{k} = AQ^{(k-1)}R’^{(k-1)} = Q^{(k)}R’^{(k)}\).
QR iteration
Not QR factorization. Get \(Q^{(k)}R^{(k)} = A^{(k-1)}\);\ \(A^{(k)}=R^{(k)}Q^{(k)} = (Q^{(k)})^{T}A^{(k-1)}Q^{(k)}\): Similarity transformation. Works for all A with distinct \(|\ew_{i}|\); easy analysis for \(A=A^{T}\).
Defn: \(R’^{(k)} = \prod R^{(k)}\), \(Q’^{(k)} = \prod_{k} Q^{(k)}\): same as \(Q^{(k)}\) in Simult itern alg.
\exclaim{Stable - finding \(\ew\) now routine!}
Convergence for real symmetric A
Same as Simultaneous iteration starting with I. \ \(A^{k} = Q^{(k)}R’^{(k)}\): So, finds orth bases for \(A^{k}\).
\(A^{(k)} = (Q’^{(k)})^{T}AQ’^{(k)}\); \(A^{(k)}{i,i}\) are \(R(Q’^{(k)}{i})\); as \(Q’^{(k)}{i}\) converges, \(A^{(k)}{i,i} \to \ew_{i}\), off diagonal entries tend to 0; so approaches Schur factorization.
Linear convergence rate: \(max_{j}\frac{\ew_{j+1}}{\ew_{j}}\).
Modified QR alg
Tridiagonalize: \((Q^{(0)})^{T}A^{(0)}Q^{(0)} = A\). Pick shift \(p^{(k)}\); get \(Q^{(k)}R^{(k)}=A^{(k-1)} - p^{(k)}I\); get \(A^{(k)} = R^{(k)}Q^{(k)} + p^{(k)}I\); if any off diagonal element is close to 0, take \(\mat{A_{1} & 0 \ 0 & A_{2} }\), deflate and apply QR.
Needs O(m) iterations costing \(O(m^{2})\) iterations each. \why
Stability
Aka Eigenvalue perturbation theory.
\part{Sparse, large matrix algebra}
Iterative Linear algebra methods
Unlike direct methods. To solve Ax=b and Ax=lx. Eg: Conjugate gradient, Lanczos, Arnoldi.
Exploiting sparsity: Use only Ax
Use black box access to methods which find Ax and \(A^{*}x\); minimize these calls. Thus, take advantage of sparsity.
Flops
Dense computations: O(m) steps, \(O(m^{2})\) ops per step, total flops \(O(m^{3})\): Also worst case for Iterative methods. But generally O(m) or \(O(m^{2})\).
Accuracy
Converge geometrically to \(\eps_{mach}\); direct methods make no progress until all \(O(m^{3})\) are completed.
Krylov sequences and subspaces of A and b
\(b, Ab, A^{2}b ..\). Krylov subspaces \(K_{r}(A, b)\) are spanned by successively larger groups of these. Can also form Krylov matrix. Orthogonalization (perhaps Gram Schmidt style) used between iterations in order to avoid erroneous linear dependence.
Convergence rate depends on spectral properties of A.
Analysis closely related to approximation of f(x) by polynomials of on subsets of the complex plane. \why
Projection of A to Krylov subspaces
Reduces the problem to problems in \(1, 2 ..\) dimensions. Look at the action of A in \(Ax=b\) or \(Ax = lx\), restricted to Krylov subspace. Approximation gets closer as number of iterations \(n \to m\).
Iteratively find ortho-basis of K(A, b)
Triangular orthogonalization
(Arnoldi). Direct method used Orthogonal triangularization (Householder); Now use Triangular Orthogonalization (Gram Schmidt) : Can be stopped in middle for partial solution \(Q_n\). Also, use a trick: Take current orthogonal basis Q, set next column \(q_{i+1}\) = the normalized component of \(Aq_i \perp (q_1 .. q_i)\). Thus, \(q_{i+1} \perp [q_0\ Aq_0\ ..\ A^{i - 1}q_0]\) also.
Description using H
Start with arbit b; normalize to get \(q_{1}\); for any n, for \(j \leq n\) get: \(h_{j,n} = \dprod{q_{j},Aq_{n}}\); get: \(h_{n+1,n} = \norm{v} = \norm{Aq_{n} - \sum_{j=1}^{n}h_{j,n}q_{j}}\): Do the subtraction mgs style when each \(h_{j,n}\) is found; find \(q_{n+1} = \frac{v}{h_{n+1,n}}\). \(Aq_{n} = Q_{n+1}h_{n} = \sum_{i=1}^{n+1} q_{i}h_{i, n}\).
So, you have almost upper triangular/ upper hessenberg \((n+1)\times n: \hat{H}n\). Get \(AQ{n} = Q_{n+1}\hat{H}_{n}\).
Square H
Take \(\hat{H}\) cut to \(n\times n: H_{n}\), the Ritz matrix; get \(H_{n} = Q_{n}^{}AQ_{n}\); Also: \(H_{n} = Q_{n}^{}Q_{n+1}\hat{H}_{n}\).
\exclaim{As if going towards Similarity transformation of A to \htext{\(H = QAQ^{}\)}{..}!} Maybe \(m \to \infty\): So maybe only solving for first \(n<m\) cols of Q in \(Q^{}AQ=H\).
\htext{\(H_{n\)}{..} as Projection of A to Krylov subspaces}
Representation in basis \(\set{q_{1} .. q_{n}}\) of the orthogonal projection of operator A onto \(K_{n}\): The shadow of operation of A in \(K_{n}\). Consider operator \(K_{n} \to K_{n}\): Take \(v = Q_{n}x \in K_{n}\): so x in basis \(Q_n\); apply A: \(A Q_n x\); project Av back to \(K_{n}\): \(Q_n^{*} A Q_n x = H_n x\).
Can’t get operator A back from \(H_{n}\).
Hermitian case: tridiagonal form
(Lanczos): Arnoldi iteration for \(A = A^{*}\). \(H_{n}\) becomes tridiagonal \(T_{n}\) with \(\set{a_{i}}\) in diagonal and \(\set{b_{i}}\) in 1st super and sub diagonals. 3 term recurrance: \(Aq_{n} = b_{n-1}q_{n-1} + a_{n}q_{n}+ b_{n}q_{n+1}\).
Reduction of arbit A to tridiagonal form
If A symmetric, \(A = QTQ^{*}\) for tridiagonal T; if A assymetric gotto give up orthogonality or tridiagnoalness: so find \(A = VTV^{-1} = VTW^{T}\) for tridiagonal but non symmetric T with diagonals \(\set{d_{2}, ..}, \set{a_{1}, ..}, \set{b_{2}, ..}\). Solve AV = VT and \(W^{T}A = TW^{T}\): 3 term recurrances.
Eigenvalue estimation
Aka Rayleigh Ritz procedure. \((H_{n}){i,j} = q{i}^{}Aq_{j}\). As \(Aq_{j} \in K_{j+1}\), for \(i>j+1\), \((H_{n}){i,j} = 0\). \((H{n}){j,j} = r(q{j}) = q_{j}^{}Aq_{j}\).
ew of \(H_{n}\) called Arnoldi ew estimates of A, Ritz values wrt \(K_{n}\) of A. ev of \(H_{n}\) are the stationary points of r(x) restricted to \(K_{n}\): \(r(x) = r(Q_{n}y) = \frac{y^{T}Q_{n}^{T}AQ_{n}y}{y^{T}y} = \frac{y^{T}H_{n}y}{y^{T}y}\). \(\gradient r(Q_{n}y) = r’(y) = 0\) iff y is ev and \(r’(y)\) is ew of \(H_{n}\).
Directly solve Ax = b
Can use triangular triangularization: but sparsity could be spoilt due to row operations: so try to order elimination steps to minimize the number of non zeros added.
For symmetric +ve definite A
A can be interpreted to be the precision matrix of a gaussian distribution \(Pr(x) \propto e^{-2^{-1}x^{T}Ax - \mean^{T}Ax}\), whose graphical model G is sparse. Then, solving \(A\mean = b\) is same as finding the mode of Pr(x). Can use loopy belief propogation to solve Ax = b. This is exact when G is a tree; the updates correspond to triangular triangularization.
Iteratively solve Ax=b
The optimization view
Error \(e_{n} = \norm{x-x_{n}}\) vs residual \(r_{n} = \norm{Ax_{n}-b}\).
View solving Ax=b as an optimization problem: minimize something like this at every iteration.
For dense A
See linear algebra ref.
For SPD A
See later section.
Generalized minimum residuals (GMRES)
Residue minimization in a subspace
At step n, approximate x by \(x_{n} \in K_{n}(A, r_{0})\) which minimizes residual \(\norm{.}{2}\) of \(r{n} = b - Ax_{n}\): minimize \(AQ_{n}y - b = Q_{n+1}\hat{H}{n}y-b\): \(\norm{Q{n+1}\hat{H}{n}y-b}{2} = \norm{\hat{H}{n}y-Q{n+1}^{*}b}{2} = \norm{\hat{H}{n}y-\norm{b}e_{1}}\). So oblique projection of problem.
Alg
At each step of Arnoldi itern to find orthobasis of \(K_{n}(A, r_{0})\): Find \(y_{n} = \argmin_{y} \norm{\hat{H}{n}y-\norm{b}e{1}}\), \(x_{n}= Q_{n}y\).
CGNR
Conjugate Gradients for minimizing residue using Normal eqns. Take Ax = b, get \(A^{}Ax = A^{}b\); then apply CG. Relationship with old r: \(\hat{r} = A^{*}r\). Ortho projection of problem under normal matrix.
But \(k(A^{*}A) = k(A)^{2}\): Slower convergence.
CGNE
Take \(A^{T}u = x\), get \(AA^{T}u = b\). \(AA^{T}\) is SPD; now apply CG. Residue r same; new direction vector \(\hat{p} = A^{-T}p\). Get monotonic decay in error vector. Ortho projection of problem under normal matrix.
Ax=b, SPD A
Many practical applications.
Reduction to quadratic programming
Consider \(A \succeq 0\). Then, any \(q(x) = (0.5) x^{T}Ax - b^{T}x + c\) is convex, has minimum when \(\gradient q(x) = Ax - b = 0\).
So, can now use ideas from quadratic programming to solve Ax = b! Can try to minimize q(x) iteratively! Move in the direction of \(\gradient q(x)\).
Conjugate gradients (CG) for SPD A
Error minimization in a subspace
Consider iteration n: restricted to subspace \(K_n\). For \(x_n \in K_n\): Take \(e_{n} = x - x_{n}\); solve optimization problem: minimize \(\norm{e_{n}}{A} = x{n}^{T}Ax_{n} - 2x_{n}^{T}b + b^{T}b = 2f(x_{n}) + k\): note definition of f().
Find \(x_{n} = argmin_{x_{n} \in K_{n}} f(x_{n})\): so minimizing only the shadow of f in \(K_{n}\). f is convex, so is \(K_n\), so it has a unique minimum.
Actual problem was to minimize q(x). Here, solving the ortho projection of problem under A.
Iteration
\(x_{0} = 0, p_{0} = r_{0} = b\). \(x_{n} = x_{n-1} + a_{n}p_{n-1}\).
Step size \(a_{n} = \frac{r_{n-1}^{T}r_{n-1}}{p_{n-1}^{T}Ap_{n-1}}\): \(f(x_{n})\) minimum when \(\gradient f(x_{n}) = Ax_{n} -b = A(x_{n-1} + a_{n}p_{n-1}) - b = a_{n}Ap_{n-1} - r_{n-1}= 0\).
Residual \(r_{n} = b - Ax_{n} = r_{n-1} - a_{n}Ap_{n-1}\); direction \(p_{n} = r_{n} + c_{n}p_{n-1}\); improvement in current step \(c_{n} = \frac{(r_{n}^{T}r_{n})}{(r_{n-1}^{T}r_{n-1})}\).
From iteration: \(K_{n} = \linspan{x_{1}, .. x_{n}} = \linspan{p_{0}, .. p_{n-1}}\).
\(\norm{e_{n-1}}{A} \geq \norm{e{n}}_{A}\). \why
Search directions are A conjugate: \(p_{n}^{T}Ap_{j} = 0\). \why \(r_{n}^{T}r_{j}=0\). \why
Conjugate residues (CR) for symmetric A
Minimize \(\norm{Ax-b}_{A}\): \ oblique projection of the problem under A.
\tbc
Biconjugate gradients (BCG) method for non singular A
Do CG on 2 systems together: Ax = b and \(A^{}x^{} = b^{*}\). If A is near-symmetric, get good convergence.
\tbc
Preconditioning
Uses
Maybe k(A) very high in \(Ax = b\); so, get an equivalent, better conditioned problem. Or, maybe want to turn it into an equivalent problem which is easier to solve.
Left preconditioning
Take \(M \in S_{++}\); \(M^{-1}Ax = M^{-1}b\). \ If \(M^{-1} \approx A^{-1}, k(M^{-1}A) \approx 1\).
Right preconditioning
Take \(AM^{-1}Mx = AM^{-1}u = b\).
Shift preconditioning
Take \$M_{L}^{-1}AM_{R}^{-1}M_{R}x = \ M_{L}^{-1}b\(. Want \)\norm{M_{L}^{-1}AM_{R}^{-1}}\$ small.
Traits of good preconditioners
Work involved is actually in solving \(My = b\), so usually want M very sparse. But, at the same time, want M to be close to A, which may be quite dense.
Joint Condition Number of M and A
k(A,M)
Finding left preconditioning matrix M
Using A=LU
Get \(A \approx LU\), get \(A^{-1} \approx M = U^{-1}L^{-1}\).
Use Incomplete LU factorization (ILU) with 0 pattern P: \(A^{-1} = U^{-1}L^{-1}\) can kill sparsity, so sacrifice accuracy for sparsity. Alter LU alg to keep sparsity.
ILU(0): P same as 0’s in A.
ILU(p): Keep level of fill \(l_{ij}: k: |a_{ij}| \approx \eps_{k}<1\), drop items with \(l_{ij}>k’\). Heuristic which doesn’t need finding log: init value: \(l_{ij} = 0\) if \(a_{ib}\neq 0\), else \(\infty\); updates while putting 0: \(l_{i,j} := \min (l_{ij}, l_{ik}+ l_{kj} + 1)\): \(l_{ik}+ l_{kj} \) corresponds to change in level. Same run for every matrix with same pattern.
ILU (Thresholding): No row op when wt is near 0, keep only p pre and post diagonal entries.
Use support graph theory
M is symmetric and +ve semidefinite.
\tbc
Find sparse M to min \htext{\(\norm{MA-I_{F}\)}{..}}
Start with random M, do gradient descent. Let \(F(M) = \norm{MA-I}_{F}\); Use Frechet derivative: \(\frac{F(M+E) - F(M)}{\norm{E}}\).
\tbc
Using preconditioners
Preconditioned conjugate gradient (PCG) for SPD A
Left preconditioning: \(M^{-1}A\) is SPD under \(\dprod{.,.}{M}\). Can adapt CG to this: always use \(\dprod{.,.}{M}\), residue \(z = M^{-1}r\). Replace new vars with old vars to get PCG alg.
Use \(M=LL^{T}\), \(\hat{p}{j} = p{j}\) etc.. to see that split PCG is equivalent to PCG.
Right preconditioning: \(AM^{-1}\) is SPD under \(\dprod{.,.}_{M^{-1}}\). Get alg equivalent to PCG.
Similarly, can apply PCG on CGNR and CGNE.
Preconditioning GMRES
Simply solve the preconditioned form of the eqn using GMRES.
Left and right preconditioning not equivalent. Right preconditioning minimizes actual, unskewed residue.