Matrix approximation

Approximating a matrix

The problem

Take set of matrices S. Want $\argmin_{B \in S} \norm{A - B}$ is minimized wrt some $\norm{}$ and some set S.

The error metric

Also, often $\norm{.}$ is an operator norm, as this ensures that $\norm{(A-B)x}$ is low wrt the corresponding vector norm. Other times, as in the case of matrix completion problems, it may be desirable for $\norm{}$ to be the Frobenius norm.

Low rank (k) factorization problem

In many problems, S is the set of rank k matrices, where k is small.

Often, we prefer $A \approx B = XY^{T}$ rather than computing B, where rank(X), rank(Y) $\leq k$: $A$ may be sparse, but the best B may be dense, so we may run out of memory while storing, and out of time while computing Bx. You can always get $B = XY^{T}$ from B just by taking SVD of B.

Restriction to subspace view

Similarly, may want restriction of $A$ to a low rank subspace: $B = QQ^{T}A$, Q is low rank, orthogonal; but it should be the best subspace which minimizes the error. $A$ good Q tries to approximate the range space of A.

Sparse approximation

Maybe want $\min_B \norm{B-A}_2: \norm{B}0 \leq k$. This can be solved just by setting B = $A$, and then dropping the k smallest $B{i, j}$ from B.

1 norm regularized form

Even the optimization problem $\min_B \norm{B-A}_2^{2} + l\norm{B(:)}1 $ leads to sparse solutions. This form is useful in some sparse matrix completion ideas. Solution: just take B=A, and then drom $B{i,j} \leq l/2$: +++(thresholding!)+++

Pf: $\min_B \sum_{i, j}(B_{i,j} - A_{i, j})^{2} + l\sum_{i, j} sgn(B_{i, j})B_{i, j} $. Let B’ be the solution to this. If $A_{i, j} \geq 0$, so is $B_{i, j}$; If $A_{i, j} \leq 0$, so is $B_{i, j}$: if they oppose in sign, setting $B_{i, j} = 0$ definitely lowers the objective. So, get equivalent problem: $\min_B \sum_{i, j}(B_{i,j} - A_{i, j})^{2} + l\sum_{i, j} sgn(A_{i, j})B_{i, j}$ subject to $sgn(A) = sgn(B)$. The new objective is differentiable. Its optimality condition: $B_{i, j} = A_{i, j} - l/2$; but the feasible set only includes $sgn(B) = sgn(A)$. So if $A_{i, j} - l/2 \leq 0$, the feasible B closest to the minimum of $\min_B \sum_{i, j}(B_{i,j} - A_{i, j})^{2} + l\sum_{i, j} sgn(A_{i, j})B_{i, j}$ has the corresponding $B_{i, j} = 0$.

Best rank t approximation of A from SVD

The approximation

Let $A_{t} = \sum_{j=1}^{t} \SW_{j}u_{j}v_{j}^{*}$ be the approximation. $A_{t}$ is the best rank t approx to $A$ (wrt $\norm{.}{2}$, $\norm{.}{F}$), captures max energy of $A$ possible: $\norm{A-A_{t}} = \inf_{B} \norm{A-B}$.

Approximation error

Then, approximation error is $A-A_{t} = \sum_{i=t+1}^{r}\sw_{i}u_{i}v_{i}^{*}$. As $\norm{X} = \sw_1(X)$: $\norm{A-A_{t}}{2} = \sw{t+1}$, $\norm{A-A_{t}}{F} = \sqrt{\sum{i=t+1}^{r} \sw_{i}^{2}}$.

Geometric interpretation

Approximate hyperellipse by line, 2-dim ellipse etc…

Approximating domain and range spaces

Using SVD for example, $A$ can be viewed as a combination of rotation and diagonal matrices. So, getting a low rank approximation of $A$ can be viewed as first getting low rank approximations of the range and domain spaces with orthogonal basis matrices $U_t$ and $V_t$ respectively, and then finding a square $S = U_t^{T}AV_t^{T}$ such that $A \approx U_t S V_t$.

Proof

Proof by $\contra$: dim(N(B))= r-t, Let $\forall w \in N(B), \norm{Aw} = \norm{(A-B)w} < \SW_{t+1}\norm{w}$; but $\exists$ t+1 subspace $\set{v: \norm{Av}\geq \SW_{t+1}}$; $dim(\set{w})+\dim(\set{v}) = r+1$: absurd.

So, $$\sw_{k} = \min_{S \subset C^{n}, dim(S) = n-k+1} \max_{x \in S} \norm{Ax}{2} \ = \max{S \subset C^{n}, dim(S) = k} \min_{x \in S} \norm{Ax}_{2}$$.

Randomized Approach

Take the $A \approx B = QQ^{T}A$ view. $A$ good Q must span $U_k$. Can use something akin to the power method of finding ev. Take random m*k matrix W. $Y = (AA^{T})^{q}AW = U\SW^{q+1}VW$, and get $Y = QR$ to get Q. q = 4 or 5 is sufficient to get good approximation of $U_k$. (Tropp etal) if you aim to get k+p approximation, you get low expected error.

With few observations in A only

Same as the missing value estimation problem.

Missing value estimation

The problem

May be you have the matrix $A$, and you have observed only a few entries $O$. If there were no other conditions, there would be infinite solutions. But, maybe you also know that $A$ has some special structure: low rank, or block structure or smoothness etc..

Can also view as getting an approximation $B \in S$ for $A$, where S is the set of matrices having the specified structure.

Smoothness constraint

Sovle $\min \sum_{i,j} d(B_{i, j}, B_{i-1,j}) + d(B_{i, j}, B_{i,j-1})$ such that $B_{i,j} = A_{i,j} \forall (i,j) \in O$. If we use l2 or l1 metrics for d(), this can be solved with convex optimization.

Low rank constraint

$rank(B) \leq k$: this is not a convex constraint.

Singular value projection (SVP)

(Jain, Meka, Dhillon) Set $B^{(0)}{i,j} = A{i,j} \forall (i,j) \in O$, set remaining values in $B^{(0)}$ arbitrarily. Then, in iteration i, do SVD of $B^{(i)}$ to get rank k approximation $B^{(i+1)}$, set $B^{(i+1)}{i,j} = A{i,j} \forall (i,j) \in O$.

Projection viewpoint

Take $$S_1 = \set{B:B_{i,j} = A_{i,j} \forall (i,j) \in O},\ S_2 = \set{B:rank(B)\leq k}$. You start with $B_0 \in S_1$, project it to the closest $C \in S_2$, project C to the closest $B_1 \in S_1$$ etc..

Applications

The netflix problem.