Differential function

Definition

Fixed direction differential fn

Aka directional derivative.

Fixing the direction $v$ , $D_{v} (f)$ can be taken to map $x$ to $D_{v} (f) (x)$ . So, $D_{v} (f) (x) : V \to F$ is a constricted version of the differential function $D (f) (x)$ .

$d f (x; h) = D_{h} (f) (x) = lim_{Δ t \to 0} \frac{f (x + Δ t h) - f (x)}{Δ t} = \frac{d}{d t} |_{t = 0} f (x + t h)$ . Aka Gateaux differential.

Alternate notation: $\nabla_{h} (f (x))$ : not the gradient vector, but its applicaiton in a certain direction.

Affine approximation view

This definition of the directional derivative is equivalent to the defining $D_{h} (f)$ as the function such that the following holds: $t \to 0$ , $f (x + t h) = f (x) + t D_{h} (f) (x)$ .

R to R case

In this special case, there is just one direction: $1$ .

Directional differentiability

If, at $x$ , the directional derivative exists in all directions, $f$ is said to be Gateaux differentiable at $x$ .

The differential of $f$ at the point $x$ in the direction $v$ is a function of two variables: $x, v$ . We regard $D (f) (x) : V \to F$ , such that $D (f (x)) [v] = \frac{d f (x + t v)}{d t}$ is the directional derivative of $f$ at $x$ along $v$ .

So, $D (f) : V \to L (V, F)$ , where $L (V, F)$ is the space of continuous linear functionals $l : V \to F$ . The fact that $D (f (x))$ is a linear functional follows from the affine approximation view of the directional derivative.

But, this is unsatisfactory as directional differentiability does not imply continuity. \why

Continuous differentiability

If at $x$ , $\exists a$ such that $\forall c, ∥ f (x + c) - f (x) - a^{T} c ∥ = o (∥ c ∥)$ , then $f$ is differentiable at $x$ ; and the derivative is $D f (x) [c] := a^{T} c$ , which maps $V \to F$ . (A measure of goodness of affine approximation!) The view $D (f) : V \to L (V, F)$ still holds.

Aka Frechet derivative, total derivative.

Connection to directional differentiability

In non pathological cases, both notions of differentiability are equivalent: This comes from applying the polynomial approximation theorem for $g : R \to R$ , $f (x + t h) \to f (x) + t D_{h} (f) (x)$ .

In the case of continuous differentiability, this follows from definition. In the case of directional differential functions, this can be seen using the polynomial approximation theorem for $f : R \to R$ : $f (x + t h) \to f (x) + t D_{h} (f) (x)$ as $t \to 0$ .

Matrix functionals

Similar definition for differential functions for functionals over the vector space of matrices. Eg: See $\nabla t r (f (X))$ in linear algebra ref.

Linearity

The differential operator $D : f \to D (f)$ is linear: So $D (f + g) = D (f) + D (g)$ : This follows from the affine approximation view of the differential function.

Note that this is separate from directional linearity.

Connection to partial derivatives

We suppose that linearity is established (simple in case of Frechet derivatives).

From linearity, $D (f (x)) [v] = \sum_{i} v_{i} D (f (x)) [e_{i}]$ . This can be written as a vector product: $D (f (x)) e_{i}$ , with D(f(x)) being a row vector. When written as a column vector, it is denoted by $\nabla (f (x))$ , in which case, $\nabla_{v} f (x) = \nabla (f (x))^{T} v$ .

Notation

$\nabla f (x) := \frac{d f (x)}{d x} := (\frac{\partial f (x)}{\partial x_{1}}, \dots) = (\frac{\partial f (x)}{\partial x_{1}}, . .)$ .

Note about representation

Note that, as explained there, ‘gradients’ are defined wrt to vectors - without differentiating between their representation as row or column vectors. Such representations are secondary to the correctness of their values, and can be altered as necessary for convenience of expression.

D(f) as a Vector field

Hence, the derivative operator $D (f) (x)$ can be viewed as a vector field, such that $D (f) (x) = \nabla f (x)$ , a vector. However, often, following the convention used for vector to vector functions, $D (f) (x)$ is denoted by the row vector $\nabla f (x)^{T}$ .

C1 smoothness

$f \in C^{1}$ if $\frac{\partial f}{\partial x_{i}}$ exists. Similarly, $C^{n}$ , even $C^{\infty}$ smoothness defined.

Differentiability vs smoothness

Gradient’s existence does not guarantee differentiability; derivative must exist in all directions - in an open ball around $c$ .

In contour graph

Perpendicular to contours

$\nabla f$ is $d$ dimensional vector. Always $⊥$ to every tangent to the contour of $f$ in $d$ dimensional space: else could move short distance along contour and increase value of f; or take $x$ and $x + ϵ$ on contour, take Taylor expansion: $f (x + ϵ) = f (x) + x^{T} \nabla f (x)$ ; thence get $x^{T} \nabla f (x) = 0$ .

Sublevel sets and gradient direction

Consider level-sets $f (x) = 0, f (x) = 0.1, f (x) = 0.2$ . $\nabla f (x)$ will be oriented towards increasing $f (x)$ , that is, away from the interior of the sublevel set ${x : f_{i} (x) \leq 0}$ . So, points outwards if convex.

In the plot

Take the plot $(x, f (x))$ . Then $\nabla f (x)$ , if it exists, is sufficient to specify the tangent hyperplane to the plot at x: see subsection on tangent hyperplanes.

Subgradients at convex points

Extension of the gradient to non-differentiable functional f(x). See convex functional section.

Differential operator

Its general properties, including linearity, product rule and the chain rule, are considered under vector functions.

Derivatives of important functionals

For simplicity in remembering the rules it is easier to think in terms of the Differential operator, rather than the gradient (which is just $D f (x)^{T}$ ).

Linear functionals

$D A x = A : \nabla A x = A^{T}, \nabla b^{T} x = b$ from Df(x) rules.

Quadratic functionals

$\nabla x^{T} A x = (A^{T} + A) x$ :

Proof

expanding $(x + δ x_{i})^{T} A (x + δ x_{i})$ .} Alternate \pf{ $D (x^{T} A x) = x^{T} A + D (x^{T} A) x$ (product rule) $= x^{T} A + x^{T} D (A^{T} x) = x^{T} (A + A^{T})$

If $A = A^{T}$ : $D (x^{T} A x) = x^{T} (2 A)$ .

Higher order differential functions

Definition

Linear map from V

Take the differential functional $D (f) (x) : V \to L (V, F)$ . $L (V, F)$ is itself a vector space, and the space of continuous linear maps $L (V, L (V, F))$ is well defined. So, we can consider the differential function of $D (f)$ . It is $D^{2} (f) (x) : V \to L (V, L (V, F))$ .

Similarly, kth order differential function $D^{k} (f) (x)$ can be defined in general.

Differential operators, of which $D^{k} f (x)$ are special cases, for general functions between vector spaces are described elsewhere.

Directional higher order differential fn

With $u$ fixed, $D_{u} (f) (x) = D (f) (x) [u]$ can be viewed as a functional: $D (f) : V \to F$ . Once can consider the differential function of $D_{u} (f)$ . Applying the definition, will be $D (D_{u} (f)) : V \to L (V, F)$ such that $D (D_{u} (f)) (x)$ is specified by $D (D_{u} (f)) (x) [v] = l t_{Δ t_{v} \to 0} \frac{D_{u} (f) (x + Δ t_{v} v) - D_{u} (f) (x)}{Δ t_{v}} = l t_{Δ t_{v}, Δ t_{u} \to 0} \frac{f (x + Δ t_{v} v + Δ t_{u} u) - f (x + Δ t_{u} u) - f (x + Δ t_{v} v) + f (x)}{Δ t_{u} Δ t_{v}} = \frac{\partial^{2}}{\partial^{2} t_{u} t_{v}} |_{t_{u}, t_{v} = 0} f (x + t_{u} u + t_{v} v)$ .

Multi-Linear map from \htext{ $V^{k}$ {V-k}}

Note that, as defined here, $D^{2} (f) (x) [u]$ is a continuous linear functional, which when provided another argument $D^{2} (f) (x) [u] [v]$ maps to a scalar.

So, using an isomorphism, it is convenient to view $D^{2} (f) (x) : V^{2} \to F$ .

Hence, $D^{2} (f) : V \to L^{2} (V, F)$ , where $L^{k} (V, F)$ is the space spanned by k-linear maps $g : V^{k} \to F$ . So, $D^{2} (f)$ maps each point $x$ to a bilinear map.

Similarly, kth order differential functions can be defined in general.

Properties

Symmetry

$D^{k} f (x)$ is symmetric, except in pathological cases which can be eliminated by a good definition. This may follow by looking at the form of $D^{2} f (x) [u, v]$ described earlier: $l t_{t_{i} \to 0} \frac{f (x + \sum t_{i} v_{i})}{t_{1} t_{2}}$ .

Wrt basis vectors

The notation $D^{2} f (x) [e_{i}] [e_{j}] = D_{i j} f (x)$ is used.

Tensor representation

$D^{2} f (x) [u] [v] = \sum_{i, j} u_{i} v_{j} D_{i, j}^{2} f (x)$ .

Proof

By the distributive property of multilinear functions. This can also be proved by applying the chain rule, the directional linearity of the differential function and the linearity of the differential operator.

Similarly $D^{k} f (x)$ can be completely specified using kth order derivatives along the basis vectors.

2nd order case

In the 2nd order case, this is aka Hessian matrix. $H_{i, j} = D_{i} D_{j} f (x)$ : Always symmetric. Aka $\nabla^{2} f (x) = \frac{\partial^{2} f (x)}{\partial x \partial x^{T}} = D \nabla f (x)$ , using the notation for derivatives of general vector to vector functions.

This matrix is important in tests for convexity at a critical point.

Polynomial approximation

See the 1-D case in complex analysis ref.

Restrict $f$ to a line $g (t) = f (a + t (x - a))$ . The polynomial approximation of this function leads us to: $f (a + v) = f (a) + \sum_{k \in . . n - 1} \frac{1}{k!} D^{k} f (x) [v]^{k} + \frac{D^{n} f (c) [v]^{n}}{n!}$ for some $c \in h u l l (a, a + v)$ in the line segment.

$D^{k} f (a) [v]^{k}$ is often written using the product of k vectors with a k-th order tensor.

Polynomial approximation series

Aka Taylor series. Similarly, in the limit get: $f (x) = \sum_{| a |} D_{a} f (a)$ . Here we have used the multi-index notation described below.

Multi-index notation

Take $b \in Z_{+}^{n}, x \in V$ . Then, $b! := \prod b_{i}!, D_{b} := D_{1}^{b_{1}} . ., x^{b} = \prod x_{i}^{b_{i}}$ .

Connection with extreme values

See optimization ref.

Derivative matrix

Motivation using directional derivatives

For every functional $f_{i} (x)$ , we have $D (f_{i}) (x) [v] = ⟨ \nabla f_{i} (x), v ⟩$ . So, this single functional $D (f_{i}) (x)$ is the row vector from the functional case.

Arrangement as rows

So, due to the definition of the differential function of vector valued functions, $D (f) (x) [v] = J v$ , where $J_{i, :} = D (f_{i}) (x)$ . So, $D (f) (x)$ is completely specified by $J$ , which may remind one of the fact that every linear operator can be represented by a matrix vector product.

This is aka Jacobian matrix. Notation: $J_{f} (x) = D f (x) = \frac{\partial (y_{1} \dots)}{\partial (x_{1} \dots)}$ : $J_{i, j} = \frac{\partial y_{i}}{\partial x_{j}}$ .

Note about dimensions

As explained in the case of derivatives of functionals, representations are secondary to the correctness of their values, and can be altered as necessary for convenience of expression. One must however pay attention to them to be consistent with other entities in the same algebraic expression.

Differential operator

Linearity follows from linearity of functional derivatives.

Row-valued functions

Sometimes, one encounters a function whose component functionals are arranged as a row vector $(f (x))^{T}$ , rather than as a column vector $f (x)$ . Though the actual derivative is the same, for the sake of consistency (eg: when it one wants to apply the product rule: $(x^{T} A) x$ and consider $D (x^{T} A)$ ), one can simply compute $[D_{x} (f (x))]^{T}$ .

Product of functions

From scalar functional derivative product rule: $D_{x} f (x)^{T} g (x) = (D_{x} f (x))^{T} g (x) + f (x) D_{x} g (x)$ . Note that this results in a column vector.

Composition of functions: chain rule

Directional differential functions

Take $h (x) = g (f (x))$ . Then $D h (x) [v] = D (g) [f (x)] D (f) [v]$ .

Proof

We want $D h (x)$ such that $l t_{t \to 0} g (f (x + t v)) = g (f (x)) + t D h (x) [v]$ . We get the result using similar definitions for small $t$ : $g (f (x + t v)) = g (f (x) + t D (f) (x) [v]) = g (f (x)) + t D (g) [f (x)] D (f) (x) [v]$

In matrix representation

In terms of derivative matrices, this is a matrix product: $D (g) [D (f) (x) [v]] = J_{g} (f (x)) J_{f} (x) v$ ! Note that order matters: first differentiate wrt outer function, then wrt inner function.

(Observe how the dimensions match perfectly: for functional (function) compositions!)

Linear and constant functions

$D (A x) [v] = A v$ , and $D (A x) = A$ : from the affine approximation definition of a derivative. $D (k) = 0$ .

Non-triviality of inversion

COnsider $f (x) = M x$ .

If J is square and M is invertible: $J_{M^{- 1}} = \frac{\partial (x_{1} \dots)}{\partial (y_{1} \dots)} = J_{M}^{- 1}$ : From inverse function thm \why. So, in general, $\frac{\partial y_{j}}{\partial x_{i}} = J_{j, i} \neq \frac{1}{\frac{\partial x_{i}}{\partial y_{j}}} = 1 / J_{i, j}^{- 1}$ unlike 1-D eqn $\frac{d x}{d y} = \frac{1}{\frac{d y}{d x}}$ .

Definition

Fixed direction differential fn

Affine approximation view

R to R case

Directional differentiability

Continuous differentiability

Connection to directional differentiability

Matrix functionals

Linearity

Connection to partial derivatives

Notation

Note about representation

D(f) as a Vector field

C1 smoothness

Differentiability vs smoothness

In contour graph

Perpendicular to contours

Sublevel sets and gradient direction

In the plot

Subgradients at convex points

Differential operator

Derivatives of important functionals

Linear functionals

Quadratic functionals

Higher order differential functions

Definition

Linear map from V

Directional higher order differential fn

Multi-Linear map from \htext{Vk{V-k}}

Properties

Symmetry

Wrt basis vectors

Tensor representation

2nd order case

Polynomial approximation

Polynomial approximation series

Multi-index notation

Connection with extreme values

Derivative matrix

Motivation using directional derivatives

Arrangement as rows

Note about dimensions

Differential operator

Row-valued functions

Product of functions

Composition of functions: chain rule

Directional differential functions

In matrix representation

Linear and constant functions

Non-triviality of inversion

Multi-Linear map from \htext{ $V^{k}$ {V-k}}