# Math 31BH: Lecture 3

Let $f \colon \mathbf{V} \to \mathbf{W}$ be a function from one Euclidean space to another. In making this declaration, we did not make the scalar products on the source and target spaces explicit. This omission is commonly made for the sake of convenience, and by abuse of notation we just denote all scalar products $\langle \cdot,\cdot \rangle$, since the vector space on which a given scalar product is defined can always be deduced from context. The resulting notational ambiguity is ultimately less confusing than requiring distinct symbols for all scalar products in play at any given time.

As discussed in Lecture 2, our goal is to do calculus with functions which take vector inputs and produce vector outputs. Since $f(\mathbf{v})$ is a function of only one variable, the vector $\mathbf{v}$, let us explain why this subject is also known as multivariable calculus. Let $\mathcal{B}=\{\mathbf{b}_1,\dots,\mathbf{b}_n\}$ be an orthonormal basis of $\mathbf{V}$, so that every $\mathbf{v} \in \mathbf{V}$ is given by

$\mathbf{v} = \langle \mathbf{b}_1,\mathbf{v}\rangle \mathbf{b}_1 + \dots + \mathbf{b}_n,\mathbf{v}\rangle \mathbf{b}_n.$

If $\mathbf{v}$ is viewed as a variable (rather than a particular vector), then its coordinates

$x_1:=\langle \mathbf{b}_1,\mathbf{v}\rangle,\dots,x_n:=\langle \mathbf{b}_m,\mathbf{v}\rangle$

relative to the orthonormal basis $\mathcal{B}$ also become variables. In other words, associated to the function ${f} \colon \mathbf{V} \to \mathbf{W}$ is another function $f_\mathcal{B} \colon \mathbb{R}^n \to \mathbf{W}$ defined by

$f_\mathcal{B}(x_1,\dots,x_n) := f(x_1\mathbf{b}_1 + \dots + x_n\mathbf{b}_n),$

which is a function of the $n$ scalar variables $x_1,\dots,x_n$, where $n =\dim \mathbf{V}$. The objects $f(\mathbf{v})$ and $f_\mathcal{B}(x_1,\dots,x_n)$ contain exactly the same information, even though the former is a function of a single vector variable $\mathbf{v}$ and the latter is a function of $n$ scalar variables $x_1,\dots,x_n$. Notice that the construction of $f_\mathcal{B}$ from $f$ makes no mention of the scalar product on $\mathbf{W}$, nor does it reference a basis in $\mathbf{W}$.

Now suppose we choose an orthonormal basis $\mathcal{C} = \{\mathbf{c}_1,\dots,\mathbf{c}_m\}$ of $\mathbf{W}$. Then, for any $\mathbf{v} \in \mathbf{V}$, we have

$f(\mathbf{v}) = \langle \mathbf{c}_1,f(\mathbf{v})\rangle \mathbf{c}_1 + \dots + \langle \mathbf{c}_m,f(\mathbf{v})\rangle \mathbf{c}_m,$

and since we are thinking of $\mathbf{v} \in \mathbf{V}$ as a variable, it is natural to think of the coordinates of $f(\mathbf{v})$ in the basis $\mathcal{C}$ as functions of $\mathbf{v}$,

$f_{\mathcal{C}1}(\mathbf{v}):=\langle \mathbf{c}_1,f(\mathbf{v})\rangle,\ \dots,\ f_{\mathcal{C}m}(\mathbf{v}):=\langle \mathbf{c}_m,\mathbf{v}\rangle$.

The functions $f_{\mathcal{C}1},\dots,f_{\mathcal{C}m} \colon \mathbf{V} \to \mathbb{R}$ so defined are called the component functions of $f$ relative to the basis $\mathcal{C}$, and they contain exactly the same information as the function $f$ itself. In particular, we have the following.

Proposition 1: Given a function $f \colon \mathbf{V} \to \mathbf{W}$ and vectors $\mathbf{p} \in \mathbf{V}$ and $\mathbf{q} \in \mathbf{W},$ we have

$\lim\limits_{\mathbf{v} \to \mathbf{p}} f(\mathbf{v}) = \mathbf{q}$

if and only if

$\lim\limits_{\mathbf{v} \to \mathbf{p}} f_{\mathcal{C}1}(\mathbf{v}) = q_1, \dots, \lim\limits_{\mathbf{v} \to \mathbf{p}} f_{\mathcal{C}m}(\mathbf{v})=q_m,$

where $q_1 = \langle \mathbf{c}_1,\mathbf{q}\rangle, \dots,q_m = \langle \mathbf{c}_m,\mathbf{q}\rangle$ are the components of the vector $\mathbf{q} \in \mathbf{W}$ relative to the orthonormal basis $\mathcal{C}=\{\mathbf{c}_1,\dots,\mathbf{c}_m\}$ of $\mathbf{W}$. In particular, the vector-valued function $f$ is continuous at $\mathbf{p} \in \mathbf{V}$ if and only if the scalar-valued functions $f_{\mathcal{C}1},\dots,f_{\mathcal{C}m}$ are continuous at $\mathbf{p}$.

Proof: Since the basis $\mathcal{C}$ is fixed, let us write $f_1,\dots,f_m$ as an abbreviation for $f_{\mathcal{C}1},\dots,f_{\mathcal{C}m}.$ We have

$f(\mathbf{v})-\mathbf{q} = (f_1(\mathbf{v})-q_1)\mathbf{c}_1+\dots+(f_m(\mathbf{v})-q_m)\mathbf{c}_m,$

hence

$\|f(\mathbf{v})-\mathbf{q}\|^2= (f_1(\mathbf{v})-q_1)^2+\dots+(f_m(\mathbf{v})-q_m)^2.$

Suppose first that $f(\mathbf{v}) \to \mathbf{q}$ as $\mathbf{v}\to \mathbf{p}$. Then, for any $\varepsilon>0$, there exists $\delta >0$ such that $\|\mathbf{v}-\mathbf{p}\| < \delta$ implies

$\|f(\mathbf{v})-\mathbf{q}\|^2=\sum_{i=1}^m (f_i(\mathbf{v})-q_i)^2 < \varepsilon,$

where we are assuming $\varepsilon <1$ so that $\varepsilon^2<\varepsilon.$ Since each term in the sum of squares is bounded by the total sum, this gives

$(f_i(\mathbf{v})-q_i)^2 < \varepsilon$

for each $i=1,\dots,m,$ which shows that $f_i(\mathbf{v})\to q_i$ as $\mathbf{v}\to\mathbf{p}.$

Conversely, suppose that for each $1 \leq i \leq m$ we have $f_i(\mathbf{v})\to q_i$ as $\mathbf{v}\to\mathbf{p},$ and let $\varepsilon > 0$ be given. Then, for each $1 \leq i \leq m$ there is a $\delta_i>0$ such that $\|\mathbf{v}-\mathbf{p}\|<\delta_i$ implies

$(f_i(\mathbf{v}-q_i)^2 < \frac{\varepsilon}{m}.$

Setting $\delta=\min(\delta_1,\dots,\delta_m)$, we thus have that $\|\mathbf{v}-\mathbf{p}\|<\delta$ implies

$\|f(\mathbf{v})-\mathbf{q}\|^2

which shows that $f(\mathbf{v})\to \mathbf{q}$ as $\mathbf{v} \to \mathbf{p}.$

Q.E.D.

Technicalities aside, the structure of the above proof is very simple: the argument is that if a sum of nonnegative numbers is small, then each term in the sum must be small, and conversely the sum of a finite number of small numbers is still small.

Now let us consider the above paragraphs simultaneously, meaning that we have chosen orthonormal bases $\mathcal{B} \subset \mathbf{V}$ and $\mathcal{C} \subset \mathbf{W}$. Then each coordinate function $f_{\mathcal{C}i}$ gives rise to an associated function $f_{\mathcal{BC}i} \colon \mathbf{R}^n \to \mathbb{R}$ defined by

$f_{\mathcal{BC}i}(x_1,\dots,x_n) :=\langle \mathbf{c}_i,f_\mathcal{B}(x_1,\dots,x_n)\rangle.$

In particular, upon choosing orthonormal bases $\mathcal{B} \subset \mathbf{V}$ and $\mathcal{C} \subset \mathbf{W}$, every function $f \colon \mathbf{V} \to \mathbf{W}$ gives rise to an associated function $f_{\mathcal{BC}} \colon \mathbb{R}^n \to \mathbb{R}^m$ defined by

$f_{\mathcal{BC}}(x_1,\dots,x_n) = (f_{\mathcal{BC}1}(x_1,\dots,x_n),\dots,f_{\mathcal{BC}m}(x_1,\dots,x_n)).$

Example 1: Let $f \colon \mathbf{V} \to \mathbf{V}$ be the function defined by $f(\mathbf{v})=\mathbf{p}+\mathbf{v}$, where $\mathbf{p} \in \mathbf{V}$ is a specified vector. This function is called “translation by $\mathbf{p}.$” Choose an orthonormal basis $\mathcal{B} \subset \mathbf{V}$, and suppose that the coordinate vector of $\mathbf{p}$ relative to this basis is $(p_1,\dots,p_n) \in \mathbb{R}^n$. Then, the function $f_{\mathbb{B}} \colon \mathbb{R}^n \to \mathbb{R}^n$ is given by

$f_{\mathcal{B}}(x_1,\dots,x_n) = (p_1,\dots,p_n) + (x_1,\dots,x_n) = (p_1+x_1,\dots,p_n+x_n),$

and

$f_{\mathcal{BB}i}(x_1,\dots,x_n) = p_i+x_i, \quad i =1,\dots,n.$

The above discussion shows that the perspective of vector calculus, in which we consider functions of a single vector variable $\mathbf{v}$, is equivalent to the perspective of multivariable calculus, in which we consider functions of multiple scalar variables $x_1,\dots,x_n$. From this perspective, one might wonder about the prospect of a “multivector” calculus in which we consider functions $f(\mathbf{v}_1,\dots,\mathbf{v}_n)$ of multiple vector variables, where it may even be that each vector variable $\mathbf{v}_i$ ranges over its own Euclidean space $\mathbf{V}_i$. In fact, this is already included in vector calculus, because such $n$-tuples of vectors are themselves single vectors in an enlarged Euclidean space.

Definition 1: Given Euclidean spaces $(\mathbf{V}_1, \langle \cdot,\cdot \rangle_1,)\dots,(\mathbf{V}_n,\langle \cdot,\cdot\rangle_n),$ their direct product is the Euclidean space consisting of the Cartesian product

$\mathbf{V}_1 \times \dots \times \mathbf{V}_n = \{(\mathbf{v}_1,\dots,\mathbf{v}_n) \colon \mathbf{v}_i \in \mathbf{V}_i$

with vector addition and scalar multiplication defined component-wise, i.e.

$(\mathbf{v}_1,\dots,\mathbf{v}_n) + (\mathbf{w}_1,\dots,\mathbf{w}_n) := (\mathbf{v}_1+\mathbf{w}_1,\dots,\mathbf{v}_n+\mathbf{w}_n)$

and

$t(\mathbf{v}_1,\dots,\mathbf{v}_n) := (t\mathbf{v}_1,\dots,t\mathbf{v}_n),$

and scalar product defined by

$\langle (\mathbf{v}_1,\dots,\mathbf{v}_n),(\mathbf{w}_1,\dots,\mathbf{w}_n)\rangle = \langle \mathbf{v}_1,\mathbf{w}_1 \rangle_1 + \dots + \langle \mathbf{v}_n,\mathbf{w}_n \rangle_n.$

It is good to be comfortable with both perspectives; the former is better for conceptual understanding, while the latter is useful for visualization and calculation.

Thus the calculus of functions of two vector variables, $f(\mathbf{v}_1,\mathbf{v}_2)$, is just the calculus of functions on the direct product Euclidean space $\mathbf{V}_1 \times \mathbf{V}_2$. Equivalently, the calculus of functions $f((x_1,\dots,x_{n_1}),(y_1,\dots,y_{n_2}))$ is the same thing as the calculus of functions $f(x_1,\dots,x_{n_1},y_1,\dots,y_{n_2}).$ There is in fact a further generalization of vector calculus called tensor calculus, which is very useful in physics and engineering (particularly in the theory of relativity and in materials science), but that is beyond the scope of this course.

Example 2: It may be tempting to throw away the more abstract perspective entirely, and in the previous lectures I have been arguing against doing this by holding up the example of the function

$f \colon \mathrm{Sym} \mathbf{V} \longrightarrow \mathbb{R}^n$

which sends each symmetric operator $S$ on $\mathbf{V}$ to the list $(s_1,\dots,s_n)$ of its eigenvalues arranged in weakly decreasing order. Conceptually, the function which sends a symmetric operator to its eigenvalues is very natural, and something you can hold in your mind quite easily. However, it is not easy to work concretely with this function by choosing coordinates. On Problem Set 1, we showed how to the choice of a basis $\mathcal{B} \subset \mathbf{V}$ leads to a corresponding basis $\mathcal{S} \subset \mathrm{Sym}\mathbf{V}$, and in particular that if $\dim \mathbf{V}=n$ then $\dim \mathrm{Sym}\mathbf{V} = {n+1 \choose 2}$. So, according to our discussion above we have an associated function

$f_\mathcal{S} \colon \mathbb{R}^{n+1 \choose 2} \longrightarrow \mathbb{R}^n.$

Moreover, if we choose the standard basis $\mathcal{E} \subset \mathbb{R}^n$, then we have component functions

$f_{\mathcal{SC}i}(S) \colon \mathbb{R}^{n+1 \choose 2} \longrightarrow \mathbb{R}$

which send a symmetric operator $S$ to its $i$th largest eigenvalue,

$f_{\mathcal{SC}i}(S)=s_i$,

and writing down a formula for these functions in terms of the coordinates of $S$ relative to $\mathcal{S}$ amounts to writing down a formula for the eigenvalues of a symmetric matrix in terms of its entries, and doing this for $n \geq 5$ is in a sense impossible. You’ll work out the $n=2$ case on PSet 2. Again, this all points to the need to be able to do approximations a la calculus, a question which we return to now.

We now come back to our general discussion of functions $f \colon \mathbf{V} \to \mathbf{W}$ from one Euclidean space to another. In linear algebra, we consider only the case where $f$ is linear, and in that context what matters are associated vector spaces like the kernel and image of $f$. However, to study more general (i.e. non-linear functions) between Euclidean spaces we need a more general vocabulary that includes a larger variety of special subsets of Euclidean space.

Defintion 1: Given a vector $\mathbf{v} \in \mathbf{V}$ and a number $\delta \in \mathbb{R}$, the open ball of radius $\delta$ centered at $\mathbf{v}$ is the subset of $\mathbf{V}$ defined by

$B_\delta(\mathbf{v}) =\{\mathbf{h} \in \mathbf{V} \colon \|\mathbf{v}-\mathbf{h}\| < \delta\}.$

This is the set of vectors $\mathbf{h}$ whose distance to $\mathbf{v}$ is strictly less than $\delta$. Observe that $B_\delta(\mathbf{v})$ is the empty set unless $\delta>0$.

In terms of open balls, the continuity of a function $f \colon \mathbf{V} \to \mathbf{W}$ at a point $\mathbf{v} \in \mathbf{V}$ may be formulated as follows: $f$ is continuous at $\mathbf{v}$ is and only if it has the property that, for any given $\varepsilon > 0$, there is a corresponding $\delta > 0$ such that the image of $B_\delta(\mathbf{v})$ under $f$ is contained in $B_\varepsilon(f(\mathbf{v}))$.

Definition 2: A subset $S$ of a Euclidean space $\mathbf{V}$ is said to be open if for any $\mathbf{v} \in \mathbb{V}$ there exists $\delta>0$ such that $B_\delta(\mathbf{v}) \subseteq S$. A subset $T$ of $\mathbf{V}$ is said to be closed if its complement $\mathbf{V}\backslash T$ is open.

There is a characterization of continuous functions in terms of open and closed sets.

Theorem 1: A function $f \colon \mathbf{V} \to \mathbf{W}$ is continuous if and only if the preimage $f^{-1}(S)$ of any open set $S \subseteq \mathbf{W}$ is open. Equivalently, $f$ is continuous if and only if the preimage $f^{-1}(T)$ of any closed set $T \subseteq \mathbf{W}$ is closed.

We won’t use Theorem 1 much, so we shall skip the proof – you will see this result again in a real analysis course.

We can also characterize continuity of a function as continuity of its components.

Theorem 2: A function $f \colon \mathbf{V} \to \mathbf{W}$ is continuous at $\mathbf{v} \in \mathbf{V}$ if and only if its component functions $f_{\mathcal{C}1},\dots,f_{\mathcal{C}m} \colon \mathbf{V} \to \mathbb{R}$ relative to an arbitrary orthonormal basis $\mathcal{C} \subset \mathbf{W}$ are continuous.

Definition 3: A set $B \subseteq \mathbf{V}$ is said to be bounded if there exists $\delta > 0$ such that $B \subseteq B_\delta(\mathbf{0}_\mathbf{V}),$ i.e. if $\|\mathbf{v}\| < \delta$ for all $\mathbf{v} \in B$.

Definition 4: A set $K \subseteq \mathbf{V}$ is said to be compact if it is closed and bounded.

Theorem (Extreme Value Theorem): If $K \subset \mathbf{V}$ is compact, every continuous function $f \colon K \to \mathbb{R}$ attains a maximum and a minimum: there are points $\mathbf{p},\mathbf{q} \in K$ such that

$f(\mathbf{p}) \geq f(\mathbf{v}) \text{ and } f(\mathbf{q}) \leq f(\mathbf{v}) \quad \forall \mathbf{v} \in \mathbf{V}$.

The point $\mathbf{p}$ is said to be a maximizer of $f$, and $\mathbf{q}$ is said to be a minimizer of $f$.

There is a particular situation in which we can say more about maximizers and minimizers. Recall that a linear combination of vectors $\mathbf{v}_1,\dots,\mathbf{v}_r \in \mathbf{V}$ is an expression of the form

$t_1\mathbf{v}_1 + \dots + t_r\mathbf{v}_r,$

where $t_1,\dots,t_r \in \mathbb{R}$ are arbitrary scalars, and that the linear span of $\mathbf{v}_1,\dots,\mathbf{v}_n$ is the subset of $\mathbf{V}$ consisting of all linear combinations of the these vectors,

$\mathrm{Span}(\mathbf{v}_1,\dots,\mathbf{v}_n) = \{\sum_{i=1}^r t_i\mathbf{v}_i \colon t_i \in \mathbb{R}\}.$

There is constrained version of this in which we consider only linear combinations whose scalar coefficients are nonnegative and sum to $1$. These special linear combinations of are called convex combinations, and the set

$\mathrm{Conv}(\mathbf{v}_1,\dots,\mathbf{v}_n) = \{\sum_{i=1}^r t_i\mathbf{v}_i \colon t_i \geq 0, \sum_{i=1}^r t_i = 1\}$

of all convex combinations of $\mathbf{v}_1,\dots,\mathbf{v}_n.$ For example, the convex hull of two vectors may be visualized as the line segment whose endpoints are $\mathbf{v}_1$ and $\mathbf{v}_2$, while the convex hull of three vectors may be visualized as the triangular region whose vertices are $\mathbf{v}_1,\mathbf{v}_2,\mathbf{v}_3$.

Theorem (Convex Optimization Theorem): For any finite set of vectors $S=\{\mathbf{v}_1,\dots,\mathbf{v}_r\} \subset \mathbf{V}$, every linear function $f \colon \mathrm{Conv}(S) \to \mathbb{R}$ has a maximizer and a minimizer in $S$.

Here is a very interesting example of a convex hull. Let $\mathbf{V}$ be a Euclidean space, and let $\mathbf{B}=\{\mathbf{b}_1,\dots,\mathbf{b}_n\}$ be an orthonormal basis in $\mathbf{V}$. Recall that a permutation is a bijective function

$\pi \colon \{1,\dots,n\} \longrightarrow \{1,\dots,n\}.$

If we write the table of values of such a function as a $2 \times n$ matrix,

$\pi = \begin{pmatrix} 1 & 2 & \dots & n \\ \pi(1) & \pi(2) & \dots & \pi(n) \end{pmatrix},$

then the bottom row of the matrix consists of the numbers $1,2,\dots,n$ arranged in some order. For example, in the case $n=3$, the permutations consist of the following $6$ matrices:

$\begin{pmatrix} 1 & 2 & 3 \\ 1 & 2 & 3 \end{pmatrix}$

$\begin{pmatrix} 1 & 2 & 3 \\ 1 & 3 & 2 \end{pmatrix}$

$\begin{pmatrix} 1 & 2 & 3 \\ 2 & 1 & 3 \end{pmatrix}$

$\begin{pmatrix} 1 & 2 & 3 \\ 2 & 3 & 1 \end{pmatrix}$

$\begin{pmatrix} 1 & 2 & 3 \\ 3 & 1 & 2 \end{pmatrix}$

$\begin{pmatrix} 1 & 2 & 3 \\ 3 & 2 & 1 \end{pmatrix}$

For each permutation $\pi$, the associated permutation operator on $\mathbf{V}$ is defined by its action on the basis $\mathcal{B}$, which is given by

$P_\pi(\mathbf{e}_j) = e_{\pi(j)}, \quad 1 \leq j \leq n.$

For example, in the case $n=3$, the matrices of these operators relative to the basis $\mathcal{B}$ are, in the same order, as follows:

$\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

$\begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}$

$\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

$\begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}$

$\begin{bmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}$

$\begin{bmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}$

Since these matrices have exactly one $1$ in every row and column, with all other entries $0$, they obviously have the property that each row and column sums to $1$. There are many more such matrices, however, and the following theorem characterizes them as the convex hull of the permutation matrices.

Theorem (Birkhoff-von Neumann theorem): The convex hull of the permutation operators consists of all operators on $\mathbf{V}$ whose matrices relative to the basis $\mathcal{B}$ have nonnegative entries, and whose rows and columns sum to $1$.

We now have everything we need to prove that the eigenvalue map is continuous. In fact, we prove the following stronger result.

Theorem (Hoffman-Wielandt inequality): Let $f \colon \mathrm{Sym}\mathbf{V} \to \mathbb{R}^n$ be the function which sends each symmetric operator on $\mathbf{V}$ to its eigenvalues listed in weakly decreasing order. Then, we have

$\|f(S)-f(T)\| \leq \|S-T\| \quad \forall S,T \in \mathrm{Sym}\mathbf{V}.$

Proof: Let us make sure we understand which norms we are using. On the LHS of the inequality, we have the norm on $\mathbb{R}^n$ corresponding to the standard inner product, so that if

$f(S)=(s_1,\dots,s_n) \quad\text{ and }\quad f(T)=(t_1,\dots,t_n),$

then

$\|f(S)-f(T)\|^2 = (s_1-t_1)^2 + \dots + (s_n-t_n)^2,$

or equivalently

$\|f(S)-f(T)\|^2 = (s_1^2+\dots+s_n)^2 - 2(s_1t_1 + \dots + s_nt_n) + (t_1^2 + \dots + t_n^2).$

On the right hand side, we have the Frobenius norm for operators,

$\|S-T\|^2 = \mathrm{Tr}(S-T)^2 = \mathrm{Tr} S^2 - 2\mathrm{Tr} ST + \mathrm{Tr} T^2.$

Observing that

$\mathrm{Tr}S^2 = s_1^2+\dots+s_n^2 \quad\text{ and }\quad \mathrm{Tr}T^2 = t_1^2+\dots+t_n^2,$

we thus have that proving $\|S-T\|^2 - \|f(S)-f(T)\| \geq 0$, which is our objective, is equivalent to proving

$\mathrm{Tr}ST \leq s_1t_1 + \dots + s_nt_n,$

which we do now.

Since $S$ is a symmetric operator on $\mathbf{V}$, by the Spectral Theorem there exists and orthonormal basis $\mathcal{S}=\{\mathbf{s}_1,\dots,\mathbf{s}_n\}$ of $\mathbf{V}$ such that

$S\mathbf{s}_j = s_j \mathbf{s}_j \quad j = 1,\dots,n.$

From Lecture 1, we have the formula

$\mathrm{Tr} ST = \sum\limits_{i,j=1}^n \langle \mathbf{s}_i,S\mathbf{s}_j\rangle \langle \mathbf{s}_j,T\mathbf{s}_i \rangle,$

so that

$\mathrm{Tr} ST = \sum\limits_{j=1}^n s_j \langle \mathbf{s}_j,T\mathbf{s}_j \rangle.$

Invoking the Spectral Theorem again, there is an orthonormal basis $\mathcal{T}=\{\mathbf{t}_1,\dots,\mathbf{t}_n\}$ such that

$T\mathbf{t}_j = t_j\mathbf{t}_j, \quad j=1,\dots,n.$

We then have that

$\langle \mathbf{s}_j,T\mathbf{s}_j \rangle = \left\langle \sum_{i=1}^n \langle \mathbf{t}_i,\mathbf{s}_j\rangle \mathbf{t}_i ,\sum_{k=1}^n \langle \mathbf{t}_k,\mathbf{s}_j\rangle T\mathbf{t}_k\right\rangle = \sum\limits_{i=1}^n t_i\langle \mathbf{t}_i,\mathbf{s}_j\rangle^2,$

and plugging this into the above we finally have

$\mathrm{Tr} ST = \sum\limits_{i,j=1}^n s_it_j \langle \mathbf{t}_i,\mathbf{s}_j\rangle^2.$

Now observe that the matrix

$D = \begin{bmatrix} {} & \vdots & {} \\ \dots & \langle \mathbf{t}_i,\mathbf{s}_j\rangle^2 & \dots \\ {} & \vdots & {} \end{bmatrix}$

has nonnegative entries, and also each row and column of $D$ sums to $1$ (why?). Thus, if we define a function on the convex hull $K$ of the $n \times n$ permutation matrices by

$g(M) = \sum\limits_{i,j=1}^n s_it_j \langle m_{ij},$

then this function is linear and we have $\mathrm{Tr} ST=g(D)$. It now follows from the convex optimization theorem together with the Birkhoff-von Neumann theorem that

$\mathrm{Tr} ST = g(D) \leq \max_\pi g(P_\pi),$

where the maximum is over all permutations $\pi$. Evaluating the right hand side, we get that

$\mathrm{Tr} ST \leq \max_\pi\sum_{i,j=1}^n s_it_{\pi(i)} = \sum\limits_{i=1}^n s_it_i,$

where the final equality follows from the fact that $s_1 \geq \dots \geq s_n$ and $t_1 \geq \dots \geq t_n.$

Q.E.D.

Corollary 1: The eigenvalue function $f \colon \mathrm{Sym} \mathbf{V} \to \mathbb{R}^n$ is continuous.

Proof: A function $h \colon X \to Y$ from one metric space to another is said to be a “contraction” if

$\mathrm{d}_Y(h(x_1),h(x_2)) \leq \mathrm{d}_X(x_1,x_2) \quad \forall x_1,x_2 \in X.$

Thus, a contraction is a function which brings points closer together, or more precisely doesn’t spread them farther apart. It is immediate to check that the definition of continuity holds for any contraction (in fact, we can choose $\delta = \varepsilon$ when checking continuity). The Hoffman-Wielandt inequality says that the eigenvalue mapping is a contraction: the distance between the eigenvalue vectors $f(S),f(T)$ of symmetric operators $S,T$ is at most Frobenius distance between $S$ and $T.$

Q.E.D.