Differential function

Definition

Fixed direction differential fn

Aka directional derivative.

Fixing the direction v, Dv(f) can be taken to map x to Dv(f)(x). So, Dv(f)(x):VF is a constricted version of the differential function D(f)(x).

df(x;h)=Dh(f)(x)=limΔt0f(x+Δth)f(x)Δt=ddt|t=0f(x+th). Aka Gateaux differential.

Alternate notation: h(f(x)) : not the gradient vector, but its applicaiton in a certain direction.

Affine approximation view

This definition of the directional derivative is equivalent to the defining Dh(f) as the function such that the following holds: t0, f(x+th)=f(x)+tDh(f)(x).

R to R case

In this special case, there is just one direction: 1.

Directional differentiability

If, at x, the directional derivative exists in all directions, f is said to be Gateaux differentiable at x.

The differential of f at the point x in the direction v is a function of two variables: x,v. We regard D(f)(x):VF, such that D(f(x))[v]=df(x+tv)dt is the directional derivative of f at x along v.

So, D(f):VL(V,F), where L(V,F) is the space of continuous linear functionals l:VF. The fact that D(f(x)) is a linear functional follows from the affine approximation view of the directional derivative.

But, this is unsatisfactory as directional differentiability does not imply continuity. \why

Continuous differentiability

If at x, a such that c,f(x+c)f(x)aTc=o(c), then f is differentiable at x; and the derivative is Df(x)[c]:=aTc, which maps VF. (A measure of goodness of affine approximation!) The view D(f):VL(V,F) still holds.

Aka Frechet derivative, total derivative.

Connection to directional differentiability

In non pathological cases, both notions of differentiability are equivalent: This comes from applying the polynomial approximation theorem for g:RR, f(x+th)f(x)+tDh(f)(x).

In the case of continuous differentiability, this follows from definition. In the case of directional differential functions, this can be seen using the polynomial approximation theorem for f:RR: f(x+th)f(x)+tDh(f)(x) as t0.

Matrix functionals

Similar definition for differential functions for functionals over the vector space of matrices. Eg: See tr(f(X)) in linear algebra ref.

Linearity

The differential operator D:fD(f) is linear: So D(f+g)=D(f)+D(g): This follows from the affine approximation view of the differential function.

Note that this is separate from directional linearity.

Connection to partial derivatives

We suppose that linearity is established (simple in case of Frechet derivatives).

From linearity, D(f(x))[v]=iviD(f(x))[ei]. This can be written as a vector product: D(f(x))ei, with D(f(x)) being a row vector. When written as a column vector, it is denoted by (f(x)), in which case, vf(x)=(f(x))Tv.

Notation

f(x):=df(x)dx:=(f(x)x1,)=(f(x)x1,..).

Note about representation

Note that, as explained there, ‘gradients’ are defined wrt to vectors - without differentiating between their representation as row or column vectors. Such representations are secondary to the correctness of their values, and can be altered as necessary for convenience of expression.

D(f) as a Vector field

Hence, the derivative operator D(f)(x) can be viewed as a vector field, such that D(f)(x)=f(x), a vector. However, often, following the convention used for vector to vector functions, D(f)(x) is denoted by the row vector f(x)T.

C1 smoothness

fC1 if fxi exists. Similarly, Cn, even C smoothness defined.

Differentiability vs smoothness

Gradient’s existence does not guarantee differentiability; derivative must exist in all directions - in an open ball around c.

In contour graph

Perpendicular to contours

f is d dimensional vector. Always to every tangent to the contour of f in d dimensional space: else could move short distance along contour and increase value of f; or take x and x+ϵ on contour, take Taylor expansion: f(x+ϵ)=f(x)+xTf(x); thence get xTf(x)=0.

Sublevel sets and gradient direction

Consider level-sets f(x)=0,f(x)=0.1,f(x)=0.2. f(x) will be oriented towards increasing f(x), that is, away from the interior of the sublevel set {x:fi(x)0}. So, points outwards if convex.

In the plot

Take the plot (x,f(x)). Then f(x), if it exists, is sufficient to specify the tangent hyperplane to the plot at x: see subsection on tangent hyperplanes.

Subgradients at convex points

Extension of the gradient to non-differentiable functional f(x). See convex functional section.

Differential operator

Its general properties, including linearity, product rule and the chain rule, are considered under vector functions.

Derivatives of important functionals

For simplicity in remembering the rules it is easier to think in terms of the Differential operator, rather than the gradient (which is just Df(x)T).

Linear functionals

DAx=A:Ax=AT,bTx=b from Df(x) rules.

Quadratic functionals

xTAx=(AT+A)x:

Proof

expanding (x+δxi)TA(x+δxi).} Alternate \pf{D(xTAx)=xTA+D(xTA)x (product rule) =xTA+xTD(ATx)=xT(A+AT)

If A=AT: D(xTAx)=xT(2A).

Higher order differential functions

Definition

Linear map from V

Take the differential functional D(f)(x):VL(V,F). L(V,F) is itself a vector space, and the space of continuous linear maps L(V,L(V,F)) is well defined. So, we can consider the differential function of D(f). It is D2(f)(x):VL(V,L(V,F)).

Similarly, kth order differential function Dk(f)(x) can be defined in general.

Differential operators, of which Dkf(x) are special cases, for general functions between vector spaces are described elsewhere.

Directional higher order differential fn

With u fixed, Du(f)(x)=D(f)(x)[u] can be viewed as a functional: D(f):VF. Once can consider the differential function of Du(f). Applying the definition, will be D(Du(f)):VL(V,F) such that D(Du(f))(x) is specified by D(Du(f))(x)[v]=ltΔtv0Du(f)(x+Δtvv)Du(f)(x)Δtv= ltΔtv,Δtu0f(x+Δtvv+Δtuu)f(x+Δtuu)f(x+Δtvv)+f(x)ΔtuΔtv =22tutv|tu,tv=0f(x+tuu+tvv).

Multi-Linear map from \htext{Vk{V-k}}

Note that, as defined here, D2(f)(x)[u] is a continuous linear functional, which when provided another argument D2(f)(x)[u][v] maps to a scalar.

So, using an isomorphism, it is convenient to view D2(f)(x):V2F.

Hence, D2(f):VL2(V,F), where Lk(V,F) is the space spanned by k-linear maps g:VkF. So, D2(f) maps each point x to a bilinear map.

Similarly, kth order differential functions can be defined in general.

Properties

Symmetry

Dkf(x) is symmetric, except in pathological cases which can be eliminated by a good definition. This may follow by looking at the form of D2f(x)[u,v] described earlier: ltti0f(x+tivi)t1t2.

Wrt basis vectors

The notation D2f(x)[ei][ej]=Dijf(x) is used.

Tensor representation

D2f(x)[u][v]=i,juivjDi,j2f(x).

Proof

By the distributive property of multilinear functions. This can also be proved by applying the chain rule, the directional linearity of the differential function and the linearity of the differential operator.

Similarly Dkf(x) can be completely specified using kth order derivatives along the basis vectors.

2nd order case

In the 2nd order case, this is aka Hessian matrix. Hi,j=DiDjf(x): Always symmetric. Aka 2f(x)=2f(x)xxT=Df(x), using the notation for derivatives of general vector to vector functions.

This matrix is important in tests for convexity at a critical point.

Polynomial approximation

See the 1-D case in complex analysis ref.

Restrict f to a line g(t)=f(a+t(xa)). The polynomial approximation of this function leads us to: f(a+v)=f(a)+k..n11k!Dkf(x)[v]k+Dnf(c)[v]nn! for some chull(a,a+v) in the line segment.

Dkf(a)[v]k is often written using the product of k vectors with a k-th order tensor.

Polynomial approximation series

Aka Taylor series. Similarly, in the limit get: f(x)=|a|Daf(a). Here we have used the multi-index notation described below.

Multi-index notation

Take bZ+n, xV. Then, b!:=bi!, Db:=D1b1..,xb=xibi.

Connection with extreme values

See optimization ref.

Derivative matrix

Motivation using directional derivatives

For every functional fi(x), we have D(fi)(x)[v]=fi(x),v. So, this single functional D(fi)(x) is the row vector from the functional case.

Arrangement as rows

So, due to the definition of the differential function of vector valued functions, D(f)(x)[v]=Jv, where Ji,:=D(fi)(x). So, D(f)(x) is completely specified by J, which may remind one of the fact that every linear operator can be represented by a matrix vector product.

This is aka Jacobian matrix. Notation: Jf(x)=Df(x)=(y1)(x1): Ji,j=yixj.

Note about dimensions

As explained in the case of derivatives of functionals, representations are secondary to the correctness of their values, and can be altered as necessary for convenience of expression. One must however pay attention to them to be consistent with other entities in the same algebraic expression.

Differential operator

Linearity follows from linearity of functional derivatives.

Row-valued functions

Sometimes, one encounters a function whose component functionals are arranged as a row vector (f(x))T, rather than as a column vector f(x). Though the actual derivative is the same, for the sake of consistency (eg: when it one wants to apply the product rule: (xTA)x and consider D(xTA)), one can simply compute [Dx(f(x))]T.

Product of functions

From scalar functional derivative product rule: Dxf(x)Tg(x)=(Dxf(x))Tg(x)+f(x)Dxg(x). Note that this results in a column vector.

Composition of functions: chain rule

Directional differential functions

Take h(x)=g(f(x)). Then Dh(x)[v]=D(g)[f(x)]D(f)[v].

Proof

We want Dh(x) such that ltt0g(f(x+tv))=g(f(x))+tDh(x)[v]. We get the result using similar definitions for small t: g(f(x+tv))=g(f(x)+tD(f)(x)[v])=g(f(x))+tD(g)[f(x)]D(f)(x)[v]

In matrix representation

In terms of derivative matrices, this is a matrix product: D(g)[D(f)(x)[v]]=Jg(f(x))Jf(x)v! Note that order matters: first differentiate wrt outer function, then wrt inner function.

(Observe how the dimensions match perfectly: for functional (function) compositions!)

Linear and constant functions

D(Ax)[v]=Av, and D(Ax)=A: from the affine approximation definition of a derivative. D(k)=0.

Non-triviality of inversion

COnsider f(x)=Mx.

If J is square and M is invertible: JM1=(x1)(y1)=JM1: From inverse function thm \why. So, in general, yjxi=Jj,i1xiyj=1/Ji,j1 unlike 1-D eqn dxdy=1dydx.