Matrix Calculus

Contents of Calculus Section

Notation
Differentials of Linear, Quadratic and Cubic Products
Differentials of Inverses, Trace and Determinant
Hessian matrices

Notation

j is the square root of -1
X^R and X^I are the real and imaginary parts of X = X^R + jX^I
X^C is the complex conjugate of X
X: denotes the long column vector formed by concatenating the columns of X
A ¤ B = KRON(A,B), the kroneker product
A • B the Hadamard or elementwise product
matrices and vectors A, B, C do not depend on X

Derivatives

In the main part of this page we express results in terms of differentials rather than derivatives for two reasons: they avoid notational disagreements and they cope easily with the complex case. In most cases however, the differentials have been written in the form dY: = dY/dX dX: so that the corresponding derivative may be easily extracted.

Derivatives with respect to a real matrix

If X is p#q and Y is m#n, then dY: = dY/dX dX: where the derivative dY/dX is a large mn#pq matrix. If X and/or Y are column vectors or scalars, then the vectorization operator : has no effect and may be omitted. dY/dX is also called the Jacobian Matrix of Y: with respect to X: and det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

Although they do not generalise so well, other authors use alternative notations for the cases when X and Y are both vectors or when one is a scalar. In particular:

dy/dx is sometimes transposed from the above definition or else is sometimes written dy/dx^T to emphasise the correspondence between the columns of the derivative and those of x^T.
dY/dx and dy/dX are often written as matrices rather than, as here, a column vector and row vector respectively. The matrix form may be converted to the form used here by appending : or :^T respectively.

Derivatives with respect to a complex matrix

If X is complex then dY: = dY/dX dX: can only be true iff Y(X) is an analytic function which implies in particular that Y(X) does not depend on X^C or X^H.

Even for non-analytic functions we can write uniquely dY: = dY/dX dX: + dY/dX^C dX^C: provided that is analytic with respect to X and X^C individually (or equivlaently with respect to X^R and X^I individually). dY/dX is the Generalized Complex Derivative and dY/dX^C is the Complex Conjugate Derivative [R.3, R.8]. We have the following relationships:

dY: = dY/dX dX: + dY/dX^C dX^C:
dY/dX = ½ (dY/dX^R - j dY/dX^I)
dY/dX^C = (dY^C/dX)^C = ½ (dY/dX^R + j dY/dX^I)
- If Y(X) is real for all complex X, then dY/dX^C= (dY/dX)^C
Cauchy Riemann equations: The following are equivalent:
- Y(X) is an analytic function of X
- dY/dX^C = 0 for all X
- dY/dX^R + j dY/dX^I = 0 for all X
dY/dX^R = dY/dX + dY/dX^C
dY/dX^I = j (dY/dX - dY/dX^C)
Chain rule: If Z is a function of Y which is itself a function of X, then dZ/dX = dZ/dY dY/dX. This is the same as for real derivatives.

Complex Gradient Vector

If f(x) is a real function of a complex vector then df/dx^C= (df/dx)^C and we can define grad(f(x)) = 2 (df/dx)^C =df/dx^R+j df/dx^I as the Complex Gradient Vector [R.8] with the following properties:

grad(f(x)) is zero at an extreme value of f .
grad(f(x)) points in the direction of steepest slope of f(x)
The magnitude of the steepest slope is equal to |grad(f(x))|. Specifically, if g(x) = grad(f(x)), then lim_a->0 a^-1( f(x+ag(x)) - f(x) ) = | g(x) |²
grad(f(x)) is normal to the surface f(x) = constant which means that it can be used for gradient ascent/descent algorithms.

Basic Properties

We may write the following differentials unambiguously without parentheses:
- Transpose: dY^T=d(Y^T)=(dY)^T
- Hermitian Transpose: dY^H=d(Y^H)=(dY)^H
- Conjugate: dY^C=d(Y^C)=(dY)^C
Linearity: d(Y+Z)=dY+dZ
Chain Rule: If Z is a function of Y which is itself a function of X, then for both the normal and the generalized complex derivative: dZ: = dZ/dY dY: = dZ/dY dY/dX dX:
Product Rule: d(YZ) =Y dZ + dY Z
- d(YZ): = (I ¤ Y) dZ: + (Z^T ¤ I) dY: = ((I ¤ Y) dZ/dX + (Z^T ¤ I) dY/dX ) dX:
Hadamard Product: d(Y • Z) =Y • dZ + dY • Z
Kroneker Product: d(Y ¤ Z) =Y ¤ dZ + dY ¤ Z

Differentials of Linear Functions

d(Ax) = d(x^TA): =A dx
- d(x^Ta) = d(a^Tx) = a^T dx
d (x^HA): = A^T dx^C
d(A^TXB): = (A^T dX B): = (B ¤ A)^T dX:
- d(a^TXb) = (b ¤ a)^T dX: = (ab^T):^T dX:
  - d(a^TXa) = d(a^TX^Ta) = (a ¤ a)^T dX: = (aa^T):^T dX:
- d(a^TX^Tb) = (a ¤ b)^T dX: = (ba^T):^T dX:

Differentials of Quadratic Products

d(Ax+b)^TC(Dx+e) = ((Ax+b)^TCD + (Dx+e)^TC^TA) dx
- d(x^TCx) = x^T(C+C^T)dx = [C=C^T] 2x^TCdx
  - d(x^Tx) = 2x^Tdx
- d(Ax+b)^T (Dx+e) = ( (Ax+b)^TD + (Dx+e)^TA)dx
  - d(Ax+b)^T (Ax+b) = 2(Ax+b)^TAdx
- d(Ax+b)^TC(Ax+b) = [C=C^T] 2(Ax+b)^TC^TA dx
d(Ax+b)^HC(Dx+e) = (Ax+b)^HCD dx + (Dx+e)^TC^TA^C dx^C
- d (x^HCx) =x^HC dx +x^TC^T dx^C = [C=C^H] 2(x^HC dx)^R
- d (x^Hx) = 2(x^H dx)^R
d(a^TX^TXb) = X(ab^T + ba^T):^T dX:
- d(a^TX^TXa) = 2(Xaa^T ):^T dX:
d(a^TX^TCXb) = (C^TXab^T + CXba^T):^T dX:
- d(a^TX^TCXa) = ((C + C^T)Xaa^T ):^T dX: = [C=C^T] 2(CXaa^T):^T dX:
d((Xa+b)^TC(Xa+b)) = ((C+C^T)(Xa+b)a^T ):^T dX:
d(X²): = (XdX + dX X): = (I ¤ X + X^T ¤ I) dX:
d(X^TCX): = (X^TCdX): + (d(X^T) CX): = (I ¤ X^TC) dX: + (X^TC^T ¤ I) dX^T:
d(X^HCX): = (X^HCdX): + (d(X^H) CX): = (I ¤ X^HC) dX: + (X^TC^T ¤ I) dX^H:

Differentials of Cubic Products

d(xx^TAx) = (xx^T(A+A^T)+x^TAxI )dx

Differentials of Inverses

d(X^-1) = -X^-1dX X^-1 [2.1]
- d(X^-1): = -(X^-T ¤ X^-1) dX:
d(a^TX^-1b) = -(X^-Tab^TX^-T ):^T dX: [2.6]

Differentials of Trace

Note: matrix dimensions must result in an n*n argument for tr().

d(tr(Y))=tr(dY)
d(tr(X)) = d(tr(X^T)) = I:^T dX: [2.4]
d(tr(X^k)) =k(X^k^-1)^T:^T dX:
d(tr(AX^k)) = (SUM_r=0:k-1(X^rAX^k-r^-1)^T ):^T dX:
d(tr(AX^-1B)) = -(X^-1BAX^-1)^T:^T dX:= -(X^-TA^TB^TX^-T):^T dX: [2.5]
- d(tr(AX^-1)) =d(tr(X^-1A)) = -(X^-TA^TX^-T ):^T dX:
d(tr(A^TXB^T)) = d(tr(BX^TA)) = (AB):^T dX: [2.4]
- d(tr(XA^T)) = d(tr(A^TX)) =d(tr(X^TA)) = d(tr(AX^T)) = A:^T dX:
d(tr(AXBX^TC)) = (A^TC^TXB^T + CAXB):^T dX:
- d(tr(XAX^T)) = d(tr(AX^TX)) = d(tr(X^TXA)) =( X(A+A^T)):^T dX:
- d(tr(X^TAX)) = d(tr(AXX^T)) = d(tr(XX^TA)) = ((A+A^T)X):^T dX:
d(tr(AXBX)) = (A^TX^TB^T + B^TX^TA^T ):^T dX:
d(tr((AXb+c)(AXb+c)^T) = 2(A^T(AXb+c)b^T):^T dX:
d(tr((X^TCX)^-1A) = [C:symmetric] d(tr(A (X^TCX)^-1) = -((CX(X^TCX)^-1)(A+A^T)(X^TCX)^-1):^T dX:
d(tr((X^TCX)^-1(X^TBX)) = [B,C:symmetric] d(tr( (X^TBX)(X^TCX)^-1) = 2(BX(X^TCX)^-1-(CX(X^TCX)^-1)X^TBX(X^TCX)^-1 ):^T dX:

Differentials of Determinant

Note: matrix dimensions must result in an n#n argument for det(). Some of the expressions below involve inverses: these forms apply only if the quantity being inverted is square and non-singular; alternative forms involving the adjoint, ADJ(), do not have the non-singular requirement.

d(det(X)) = d(det(X^T)) = ADJ(X^T):^T dX: = det(X) (X^-T):^T dX: [2.7]
d(det(A^TXB)) = d(det(B^TX^TA)) = (A ADJ(A^TXB)^TB^T):^T dX: = [A,B: nonsingular] det(A^TXB) × (X^-T):^T dX: [2.8]
d(ln(det(A^TXB))) = [A,B: nonsingular] (X^-T):^T dX: [2.9]
- d(ln(det(X))) = (X^-T):^T dX:
d(det(X^k)) = d(det(X)^k) = k × det(X^k) × (X^-T):^T dX: [2.10]
d(ln(det(X^k))) = k × (X^-T):^T dX:
d(det(X^TCX)) = [C=C^T] 2det(X^TCX)×(CX(X^TCX)^-1):^T dX: [2.11]
- = [C=C^T, CX: nonsingular] 2det(X^TCX)×(X^-T):^T dX:
d(ln(det(X^TCX))) = [C=C^T] 2(CX(X^TCX)^-1):^T dX:
- = [C=C^T, CX: nonsingular] 2(X^-T):^T dX:
d(det(X^HCX)) = det(X^HCX)× (C^TX^C (X^TC^TX^C)^-1)dX: + (CX(X^HCX)^-1):^T dX^C:) [2.12]
d(ln(det(X^HCX))) = (C^TX^C (X^TC^TX^C)^-1):^TdX: + (CX(X^HCX)^-1):^T dX^C: [2.13]

Jacobian

dY/dX is called the Jacobian Matrix of Y: with respect to X: and J_X(Y)=det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

J_X(X_[n#n]^-1)= (-1)ⁿdet(X)^-2n

Hessian matrix

If f is a real function of x then the Hermitian matrix H_x f= d/dx (df/dx)^H is the Hessian matrix of f(x). A value of x for which grad f(x) = 0 corresponds to a minimum, maximum or saddle point according to whether H_x f is positive definite, negative definite or indefinite.

H_x (a^Tx) = 0
H_x (Ax+b)^TC(Dx+e) = A^TCD + D^TC^TA
- H_x (x^TCx) = C+C^T
  - H_x (x^Tx) = 2I
- H_x (Ax+b)^T (Dx+e) = A^TD + D^TA
  - H_x (Ax+b)^T (Ax+b) = 2A^TA
- H_x (Ax+b)^TC(Ax+b) = [C=C^T] 2A^TCA
H_x (x^HCx) = [C=C^H] 2C

This page is part of The Matrix Reference Manual. Copyright © 1998-2005 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: $Id: calculus.html,v 1.14 2005/08/17 10:42:09 dmb Exp $