Matrix approximation

Approximating a matrix

The problem

Take set of matrices S. Want ${a r g m i n}_{B \in S} ∥ A - B ∥$ is minimized wrt some $∥ ∥$ and some set S.

The error metric

Also, often $∥ . ∥$ is an operator norm, as this ensures that $∥ (A - B) x ∥$ is low wrt the corresponding vector norm. Other times, as in the case of matrix completion problems, it may be desirable for $∥ ∥$ to be the Frobenius norm.

Low rank (k) factorization problem

In many problems, S is the set of rank k matrices, where k is small.

Often, we prefer $A \approx B = X Y^{T}$ rather than computing B, where rank(X), rank(Y) $\leq k$ : $A$ may be sparse, but the best B may be dense, so we may run out of memory while storing, and out of time while computing Bx. You can always get $B = X Y^{T}$ from B just by taking SVD of B.

Restriction to subspace view

Similarly, may want restriction of $A$ to a low rank subspace: $B = Q Q^{T} A$ , Q is low rank, orthogonal; but it should be the best subspace which minimizes the error. $A$ good Q tries to approximate the range space of A.

Sparse approximation

Maybe want $\min_B \norm{B-A}_2: \norm{B}0 \leq k$. This can be solved just by setting B = $A$ , and then dropping the k smallest $B{i, j}$ from B.

1 norm regularized form

Even the optimization problem $\min_B \norm{B-A}_2^{2} + l\norm{B(:)}1 $ leads to sparse solutions. This form is useful in some sparse matrix completion ideas. Solution: just take B=A, and then drom $B{i,j} \leq l/2$: (thresholding!)

Pf: $min_{B} \sum_{i, j} (B_{i, j} - A_{i, j})^{2} + l \sum_{i, j} s g n (B_{i, j}) B_{i, j}$ . Let B’ be the solution to this. If $A_{i, j} \geq 0$ , so is $B_{i, j}$ ; If $A_{i, j} \leq 0$ , so is $B_{i, j}$ : if they oppose in sign, setting $B_{i, j} = 0$ definitely lowers the objective. So, get equivalent problem: $min_{B} \sum_{i, j} (B_{i, j} - A_{i, j})^{2} + l \sum_{i, j} s g n (A_{i, j}) B_{i, j}$ subject to $s g n (A) = s g n (B)$ . The new objective is differentiable. Its optimality condition: $B_{i, j} = A_{i, j} - l / 2$ ; but the feasible set only includes $s g n (B) = s g n (A)$ . So if $A_{i, j} - l / 2 \leq 0$ , the feasible B closest to the minimum of $min_{B} \sum_{i, j} (B_{i, j} - A_{i, j})^{2} + l \sum_{i, j} s g n (A_{i, j}) B_{i, j}$ has the corresponding $B_{i, j} = 0$ .

Best rank t approximation of A from SVD

The approximation

Let $A_{t} = \sum_{j = 1}^{t} Σ_{j} u_{j} v_{j}^{*}$ be the approximation. $A_{t}$ is the best rank t approx to $A$ (wrt $\norm{.}{2}$, $\norm{.}{F}$), captures max energy of $A$ possible: $∥ A - A_{t} ∥ = inf_{B} ∥ A - B ∥$ .

Approximation error

Then, approximation error is $A - A_{t} = \sum_{i = t + 1}^{r} σ_{i} u_{i} v_{i}^{*}$ . As $∥ X ∥ = σ_{1} (X)$ : $\norm{A-A_{t}}{2} = \sw{t+1}$, $\norm{A-A_{t}}{F} = \sqrt{\sum{i=t+1}^{r} \sw_{i}^{2}}$.

Geometric interpretation

Approximate hyperellipse by line, 2-dim ellipse etc…

Approximating domain and range spaces

Using SVD for example, $A$ can be viewed as a combination of rotation and diagonal matrices. So, getting a low rank approximation of $A$ can be viewed as first getting low rank approximations of the range and domain spaces with orthogonal basis matrices $U_{t}$ and $V_{t}$ respectively, and then finding a square $S = U_{t}^{T} A V_{t}^{T}$ such that $A \approx U_{t} S V_{t}$ .

Proof

Proof by $\Rightarrow\Leftarrow$ : dim(N(B))= r-t, Let $\forall w \in N (B), ∥ A w ∥ = ∥ (A - B) w ∥ < Σ_{t + 1} ∥ w ∥$ ; but $\exists$ t+1 subspace ${v : ∥ A v ∥ \geq Σ_{t + 1}}$ ; $d i m ({w}) + \dim ({v}) = r + 1$ : absurd.

So, $$\sw_{k} = \min_{S \subset C^{n}, dim(S) = n-k+1} \max_{x \in S} \norm{Ax}{2} \ = \max{S \subset C^{n}, dim(S) = k} \min_{x \in S} \norm{Ax}_{2}$$.

Randomized Approach

Take the $A \approx B = Q Q^{T} A$ view. $A$ good Q must span $U_{k}$ . Can use something akin to the power method of finding ev. Take random m*k matrix W. $Y = (A A^{T})^{q} A W = U Σ^{q + 1} V W$ , and get $Y = Q R$ to get Q. q = 4 or 5 is sufficient to get good approximation of $U_{k}$ . (Tropp etal) if you aim to get k+p approximation, you get low expected error.

With few observations in A only

Same as the missing value estimation problem.

Missing value estimation

The problem

May be you have the matrix $A$ , and you have observed only a few entries $O$ . If there were no other conditions, there would be infinite solutions. But, maybe you also know that $A$ has some special structure: low rank, or block structure or smoothness etc..

Can also view as getting an approximation $B \in S$ for $A$ , where S is the set of matrices having the specified structure.

Smoothness constraint

Sovle $min \sum_{i, j} d (B_{i, j}, B_{i - 1, j}) + d (B_{i, j}, B_{i, j - 1})$ such that $B_{i, j} = A_{i, j} \forall (i, j) \in O$ . If we use l2 or l1 metrics for d(), this can be solved with convex optimization.

Low rank constraint

$r a n k (B) \leq k$ : this is not a convex constraint.

Singular value projection (SVP)

(Jain, Meka, Dhillon) Set $B^{(0)}{i,j} = A{i,j} \forall (i,j) \in O$, set remaining values in $B^{(0)}$ arbitrarily. Then, in iteration i, do SVD of $B^{(i)}$ to get rank k approximation $B^{(i + 1)}$ , set $B^{(i+1)}{i,j} = A{i,j} \forall (i,j) \in O$.

Projection viewpoint

Take $S_{1} = {B : B_{i, j} = A_{i, j} \forall (i, j) \in O}, S_{2} = {B : r a n k (B) \leq k} $. Y o u s t a r t w i t h $ B_{0} \in S_{1} $, p r o j e c t i t t o t h e c l o s e s t $ C \in S_{2} $, p r o j e c t C t o t h e c l o s e s t $ B_{1} \in S_{1}$ etc..

Applications

The netflix problem.