Math 202C: Lecture 21 -The sensitivity conjecture

Preliminary definitions and Notations

First up, a list of preliminary notations and definitions of terms that recur throughout this blog post.

  • Q^n denotes the n-dimensional hypercube graph whose vertex set consists of all strings in \{-1,1\}^n and two vertices are adjacent iff the corresponding strings differ exactly in one-coordinate.
  • For an undirected graph G, the maximum degree of a vertex in G is denoted by \Delta(G) and \lambda_1(G) denotes the largest eigenvalue of the adjacency matrix of G.
  • For x \in \{-1,1\}^n and a subset S of indices from [n] = \{1, 2, \cdots, n\}, the string obtained from x by flippling all entries of x corresponding to the indices in S is denoted by x^S.
  • Any function f : \{-1,1\}^n \rightarrow \{-1,1\} is called a boolean function.

Definition (Local sensitivity and sensitivity). Given a boolean function f : \{-1,1\}^n \rightarrow \{-1,1\}, the local sensitivity on the input x, denoted by s(f,x), is defined as the number of indices i, such that f(x) \neq f(x^{\{i\}}), and the sensitivity s(f) of f is \max_x s(f,x).

Definition (Local block sensitivity and sensitivity). Given a boolean function f : \{-1,1\}^n \rightarrow \{-1,1\}, the local block sensitivity on the input x, denoted by bs(f,x) is defined as the maximum number of disjoint blocks B_1, \cdots B_k of [n] such that for each B_i, f(x) \neq f(x^{B_i}), and the block sensitivity bs(f) of f is \max_x bs(f,x).

The sensitivity conjecture and the tale of three theorems

(Sensitivity conjecture) There exists an absolute constant C >0, such that for every boolean function f, bs(f) \leq s(f)^C.

Sensitivity and block sensitivity are examples of complexity measures for boolean functions. Two complexity measures A and B are said to be polynomially related to each other if there exist polynomials p(t), q(t) such that for any Boolean function f, we have A(f) \leq p(B(f)) and B(f) \leq q(A(f)). The block sensitivity is known to be polynomially related to other complexity measures of boolean functions such as the degree (to be defined later). However, the question (a.k.a the sensitivity conjecture) of whether the block sensitivity is polynomially related to the sensitivity was posed as an open problem by Nisan and Szagedy in 1989 (Note that bs(f) \geq s(f) by a direct consequence of their definitons. The open problem was showing whether bs(f) \leq s(f)^C for some C >0). A few years later in 1992, Gotsman and Linial [2] showed the equivalence of the sensitivity conjecture to another open problem on the hypercube. Finally in 2019, Hao Huang resolved the sensitivity conjecture by leveraging on the equivalence shown by Gotsman and Linial.

Cue harmonic analysis on finite abelian groups

In lecture post 7 by Prof. Novak, we saw that \mathrm{B}(n) is isomorphic to the group \mathbb{Z}_2^n, i.e. bitstrings of length n (Recall that B(n) denotes the subgroup of \mathrm{S}(2n) generated by the n disjoint transpositions \tau_1=(1\ 2),\ \tau_2=(3\ 4),\ \dots, \tau_n = (2n-1\ 2n)). Thus, a boolean function f:  \{-1,1\}^n \rightarrow \{-1,1\} , can be identified with a member of the Hilbert space \mathbb{C}^{ \mathrm{B}(n) } and the rich tools of harmonic analysis on B(n) discussed in lecture posts 7-9 are applicable to boolean functions. Note that the hypercube Q^n is the subgraph of the Caley graph of S(2n) induced by the subgroup B(n). Thus, the sensitivity conjecture can be rephrased in the language of induced subgraphs of the hypercube Q^n and this is exactly the essence of the work done by Gotsman and Linial in 1992.

Before we dive into the details, a few (post)-preliminary notations and remarks.

  • Given an induced subgraph H of Q^n, let Q^n- H denote the subgraph of Q^n induced on the vertex set V(Q) \backslash V(H).
  • For S \subseteq [n] := \{1,2, \cdots, n\}, the function \chi_S : \{-1,1\}^n \rightarrow \{+1,-1\} is defined as \chi_S(x) = \Pi_{i \in S}x_i, with the convention that for all x, \: \: \chi_{\phi}(x) = 1 where \phi denotes the empty set.
  • In lecture post 7 (and also in the qual exam) we saw that the functions \{\chi_S\}_{S \subseteq [n]} (which exactly correspond to the characters of B(n) since \mathbb{Z}_2^n is isomorphic to \mathrm{B}(n)) form an orthogonal basis for \mathbb{C}^{ \mathbb{Z}_2^n } under the inner product \langle f,g \rangle = \sum_{x \in \{-1,1\}^n} f(x)g(x).
  • Thus, given a boolean function f, by Fourier expansion, f can be written as f(x) = \sum_{S \subseteq [n]} \hat{f}(S)\chi_S(x), where \hat{f} denotes the Fourier coefficients.
  • The average of a boolean function f on Q^n is denoted as \mathbf{E}(f) : = \frac{1}{2^n}\sum_{x \in Q^n}f(x). Note that since \chi_{\phi}(x) = 1 for all x, we have the identity \hat{f}(\phi) = \mathbf{E}(f) which is quite useful when proving theorem 1 (see below).

Definition. Given a boolean function f, the degree of f is defined as deg(f) := \max _{S \subseteq [n]}\{|S| \: | \: \hat{f}(S) \neq 0\}.

Theorem 1 [Gotsman and Linial]. The following two statements are equivalent for any monotone strictly increasing function h: \mathbb{N} \rightarrow \mathbb{R}.
a. For any induced subgraph H of Q^n with |V(H)| \neq 2^{n-1}, we have \max \{ \Delta(H), \Delta(Q^n - H)\} \geq h(n).
b. For any boolean function f, we have that s(f) \geq h(deg(f)).

Gotsman and Linial show that proving the equivalence of statements a and b above is equivalent to proving the equivalence of yet another two statements and then proceed to prove this second equivalence.

Proof of Theorem 1. First, we show the equivalence of statement a to following statement a’:

a’. For any boolean function g, \mathbf{E(g)} \neq 0 implies there exists an input x such that n - s(g,x) \geq h(n).

a’ implies a : Given an induced subgraph H of Q^n, it can be associated with a corresponding boolean function g defined such that g(x) = 1 iff x \in V(H). By a direct consequence of the definitions of adjacency in Q^n and the local sensitivity of g, \: \: \text{deg}_H(x) = n - s(g,x) for x \in V(H) and \text{deg}_{Q^n -H}(x) = n - s(g,x) for x \in V(Q) \backslash V(H) (Note that here \text{deg}_H(x) denotes the degree of the vertex x in the subgraph H and is not to be confused with the degree of a boolean function that was defined earlier). Thus, \max \{ \Delta(H), \Delta(Q^n - H)\}  =  \max_x  \{n - s(g,x)\}. Let \mathbf{E}(g) denote the average value of g on Q^n, i.e. \mathbf{E}(g) = \frac{1}{2^n}\left[ \sum_{x \in V(H)}g(x) + \sum_{x  \in V(Q)\backslash V(H)}g(x) \right]. Note that V(H) \neq 2^{n-1} implies \mathbf{E}(g) \neq 0. Thus, statement a’ implies \max \{ \Delta(H), \Delta(Q^n - H)\}  =   \max_x  \{n - s(g,x)\} \geq h(n).

a implies a’ : Given a boolean function g, define H to be the subgraph induced by the set of vertices \{x \in Q^n \: | \: g(x) = 1\}. By an identical arument as in the previous paragraph, \max \{ \Delta(H), \Delta(Q^n - H)\}  =  \max_x  \{n - s(g,x)\}. Also, \mathbf{E}(G) \neq 0 implies V(H) \neq 2^{n-1}. Thus, statement a implies \max \{ \Delta(H), \Delta(Q^n - H)\}  =  \max_x  \{n - s(g,x)\} \geq \sqrt{n} thereby showing the equivalent of statements a and a’.

Now, we show the equivalence of statement b to statement b’:

b’. For any boolean function f, s(f) < h(n) implies deg(f) < n.

Assuming b and that the premise for b’ is true, then h(deg(f)) < s(f) < h(n) and since h is assumed to be monotonic strictly increasing this shows that deg(f) < n and thus b implies b’. Now, towards showing the other implication let f denote a boolean function with degree d \leq n. Assume without loss of generality that for S =\{1, \cdots, d\}, \: \: \hat{f}(S) \neq 0 . Now, consider the function g : \{-1,1\}^d \rightarrow \{-1,1\} defined as g(x_1, x_2, \cdots x_d) := f(x_1, x_2, \cdots, x_d, -1, -1, \cdots -1). By construction s(f) \geq s(g) and deg(g) = d, i.e g has full degree as a boolean function on Q^d. Now, the contrapositive of b’ says that deg(g) = d implies s(g) \geq h(d). Thus, s(f) \geq  s(g) \geq h(d) = h(deg(f)) and this finshes showing the equivalence of b and b’.

Thus, proving theorem 1 has been reduced to the task of proving the equivalence of statements a’ and b’. Now, given a boolean function f, define another boolean function g(x) = f(x)\chi_{[n]}(x) where \chi_{[n]}(x) = \Pi_{i=1}^nx_i is the parity function of x. Note that \hat{g}(S) =  \frac{1}{2^n}\sum_{x \in Q^n}f(x) \chi_{[n]}(x)   \chi_S(s). Thus,

\hat{g}(S)  = \frac{1}{2^n}\sum_{x \in Q^n}f(x)   (\Pi_{i=1}^nx_i)   (\Pi_{i \in S}x_i) = \frac{1}{2^n}\sum_{x \in Q^n}f(x) \Pi_{i \notin S}x_i = \hat{f}([n] \backslash S)

Thus, for all S \subseteq [n], \hat{g}(S) = \hat{f}([n]\backslash S). Also note that for all x \in Q^n, s(g,x) = n- s(f,x) because every bit flip which causes a change in value of f(x) does not cause a change in function value of g(x) and vice-versa. Thus, \mathbf{E}(g) = \hat{g}(\phi) = \hat{f}([n]) where \phi denotes the empty set.

a’ implies b’: Towards a contradiction assume that s(f) < h(n) and deg(f) = n . This by definition implies \hat{f}([n]) \neq 0 and thus \mathbf{E}(g) \neq 0. By a’, there exist a x such that s(g,x) \leq n - h(n) and since s(g,x) = n- s(f,x) this implies there exists a x such that s(f,x) \geq h(n) which contradicts the premise that s(f) < h(n).

b’ implies a’: Again towards a contradiction assume that given g such that \mathbf{E}(g) \neq 0 we have that for all x, \: s(g,x) > n- h(n). Since s(g,x) = n- s(f,x) this implies that s(f) < h(n) which by 2′ implies deg(f) <n. Note that deg(f) < n is equivalent to 0 = \hat{f}([n]) and thus \hat{f}([n]) = \hat{g}(\phi) = \mathbf{E}(g) = 0 which contradicts the premise that \mathbf{E}(g) \neq 0 and this completes the proof of theorem 1.

Theorem 2 [Huang] . For every integer n \geq 1, let H be an arbitrary (2^{n-1}+1)-vertex induced subgraph of Q^n, then

\Delta(H) \geq \sqrt{n}

The proof of theorem 2 is going to be the main focus for the remainder of this blogpost but before that we discuss two corollaries which show why theorem 2 resolves the sensitivity conjecture.

Corollary 1. For every boolean function f, s(f) \geq \sqrt{deg(f)}.

Proof of Corollary 1. For any induced subgraph H of Q^n with |V(H)| \neq 2^{n-1}, since the total number of vertices in Q^n is 2^n, either H contains atleast 2^{n-1}+1 vertices or Q^n - H does. So without loss of generality, assume H contains atleast 2^{n-1}+1 vertices (If not, just replace H with Q^n -H in the remainder of the proof). Now, theorem 2 (along with the fact that the maximum degree \Delta(H) is non-decreasing in the number of vertices in H) implies \max \{ \Delta(H), \Delta(Q^n - H)\} \geq \sqrt{n}. Let h(n) := \sqrt{n} and note that it is a monotone strictly increasing function on \mathbb{N}. Now, theorem 1 implies s(f) \geq \sqrt{deg(f)} and this completes the proof.

Theorem 3 [Tal]. For every boolean function f, bs(f) \leq deg(f)^2.

It should be noted that theorem 3 is an imporvement on the work by Nisan and Szegedy in 1992 wherein they showed that bs(f) \leq 2deg(f)^2.

Corollary 2. For every boolean function f, bs(f) \leq s(f)^4.

Note that corollary 2 follows immediately from theorem 3 and corollary 1. Thus, corollary 2 confirms the sensitivity conjecture by showing that the block sensitivity and the sensitivity of a boolean function are indeed polynomially related.

Cue matrix analysis – “A proof that came straight from THE Book”

In this section, we will focus on Hao Huang’s proof of theorem 2. The core chunk of the proof invovles a creative construction of an iterative sequence of matrices whose n^{th} instance is closely related to the adjacency matrix of Q^n and an application of Cauchy’s interlace theorem which is stated as follows,

(Cauchy’s interlace theorem) Let A be a symmetric n \times n, B be a m \times m principal submatrix of A for some m<n. If the eigenvalues of A are \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n, and the eigenvalues of B are \mu_1 \geq \mu_2 \geq \cdots \geq \mu_n, then for all 1 \leq i \leq m,

\lambda_i \geq \mu_i \geq \lambda_{i+n-m}

Recall that we saw in Math 202A that Cauchy’s interlace theorem can be proven via a direct application of the Courant-Fischer theorem.

Now, we move on to the iterative matrix sequence that turns out to be a bite-sized kryptonite for the sensitivity conjecture. Consider the sequence of symmetric square matrices defined iteratively as follows,

B_1 := \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \qquad B_n :=\begin{pmatrix} B_{n-1} & I \\ I & -B_{n-1} \end{pmatrix}

The first exercise is to study the spectrum of B_n and the second exercise is to compute an upper bound for the largest eigenvalue \lambda_1 of a specific type of matrix. These two exercises then set us up nicely for a clever application of Cauchy’s interlace theorem to prove theorem 2.

Exercise 1.
a)
Show that B_n^2 = nI where I denotes the 2^n \times 2^n identity matrix.
b) Show that the trace of B_n is 0.
c) Conclude that the eigenvalues of B_n are \sqrt{n} with multiplicity 2^{n-1} and -\sqrt{n} with multiplicity 2^{n-1} .

Exercise 2. Suppose H is a m-vertex undirected graph and B is a symmetric matrix whose entries are in \{-1,0,1\}, and whose rows and columns are indexed by the vertex set of H. Also assume B_{u,v} = 0 whenever u and v are non-adjacent vertices in H. Then show that

\Delta(H) \geq \lambda_1 := \lambda_1(B)

The defintion of this iterative sequence of matrices \{B_j\}_1^n is not without precedent. Note that the adjacency matrix of Q^n can also be defined in a similarly iterative manner. Recall that a vertex of Q^n is a string in \{-1,1\}^n. Given a (n-1)-length string of -1s and 1s it naturally extends to exactly two strings in \{-1,1\}^n and these two n-strings correspond to the only two vertices of Q^n whose corresponding strings have their first n-1 entries to be the same as the original n-1 string (note that these two vetices of Q^n are adjacent). Exapnding on this observation, the hypercube Q^n can be thought of as having two Q^{n-1} sub-hypercubes with a perfect matching connecting these two subcubes. In otherwords, consider the following sequence of iteratively defined matrices,

A_1 := \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \qquad A_n :=\begin{pmatrix} A_{n-1} & I \\ I & A_{n-1} \end{pmatrix}

Then A_n represents the adjacency matrix of Q^n and in particular, note that A_n is equal to the matrix obtained by changing every (-1) entry of B_n to 1.

Now that we have all the necessary ingredients, let’s discuss the proof of Theorem 2.

(Restated) Theorem 2 [Huang] . For every integer n \geq 1, let H be an arbitrary (2^{n-1}+1)-vertex induced subgraph of Q^n, then

\Delta(H) \geq \sqrt{n}

Proof of Theorem 2 [Haung].

Let B_n be the sequence of matrices that we defined above and let H be a (2^{n-1} + 1)-vertex induced subgraph of Q^n. Note that H naturally induces a principal submatrix B_H of B_n (written with respect to an appropriate permutation of the standard basis). As discussed in the preceeding paragraph, by changing every (-1) entry of B_n to 1, we get the adjacency matrix of Q^n. So, in particular, B_H and H satisfy the conditions of the statement in Exercise 2 and thus,

\Delta(H) \geq \lambda_1(B_H)

Also, since B_H is a (2^{n-1} + 1) \times (2^{n-1} + 1) principal submatrix of the 2^n \times 2^n matrix B_n, by Cauchy’s interlace theorem,

\lambda_1(B_H) \geq \lambda_{2^{n-1}}(B_n)

From Exercise 1, we know that the eigenvalues of B_n are \sqrt{n} with multiplicity 2^{n-1} and -\sqrt{n} with multiplicity 2^{n-1}. Thus, \lambda_{2^{n-1}}(B_n) = \sqrt{n} and thus,

\Delta(H) \geq \lambda_1(B_H) \geq \lambda_{2^{n-1}}(B_n) = \sqrt{n}

and this completes the proof.

Closing Remarks

I will end this blog post for our Math 202C class with a quote from Scott Aaronson’s blog post about Hao Huang’s proof of the sensitivity conjecture.

Paul Erdös famously spoke of a book, maintained by God, in which was written the simplest, most beautiful proof of each theorem. The highest compliment Erdös could give a proof was that it “came straight from the book.” In this case, I find it hard to imagine that even God knows how to prove the Sensitivity Conjecture in any simpler way than this.

-Scott Aaronson

The funny thing is, a few days after Prof. Aaronson’s blogpost, his former student Shalev Ben-David posted in the comments section, a simplication of Huang’s proof which does not use Cauchy’s interlacing theorem.

Anyways, talking about the sensitivity conjecture is indeed quite a fitting end to the Math 202 series since both, Gotsman & Linial’s work and Hao Huang’s work on this 30 year old ex-open problem utilize almost exclusively, concepts which we have studied over the past three quarters.

Thanks for reading my post and have a great summer everyone!

References
  • [1] Huang, H. (2019). Induced subgraphs of hypercubes and a proof of the sensitivity conjecture. Annals of Mathematics190(3), 949-955.
  • [2] N. Nisan, M. Szegedy, On the degree of Boolean functions as real polynomials, Comput. Complexity, 4 (1992), 462–467.
  • [3] C. Gotsman, N. Linial, The equivalence of two problems on the cube, J. Combin.Theory Ser. A, 61 (1) (1992), 142–146.
  • [4] F. Chung, Z. F¨uredi, R. Graham, P. Seymour, On induced subgraphs of the cube, J. Comb. Theory, Ser. A, 49 (1) (1988), 180–187.
  • [5] A. Tal, Properties and applications of boolean function composition, Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ITCS ’13, 441–454.
  • [6] Karthikeyan, Rohan & Sinha, Siddharth & Patil, Vallabh. (2020). On the resolution of the sensitivity conjecture. Bulletin of the American Mathematical Society. 57. 1. 10.1090/bull/1697.
  • [7] https://www.scottaaronson.com/blog/?p=4229
  • [8] Lecture notes for Math 202A by Prof. Philip Gill.
  • [9] D. Kroes, J. Naranjo, J. Nie, J. O’ Neill, N. Seiger, S. Spiro, E. Zhu, Linear Algebra methods in Combinatorics, ABACUS notes Fall 2019 UCSD.

Math 202C: Lecture 20 – Uncertainty Principle

Author: Zhaolong Han

In this lecture, we will introduce the Fourier transform on finite Abelian group and a discrete version of “uncertainty principle” based on the chapter 2 of “Discrete Harmonic Analysis” by Ceccherini-Sillberstein[1]. In quantum mechanics, there is a well-known uncertainty principle saying that the more precisely the position of some particle is determined, the less precisely its momentum can be predicted from initial conditions, and vice versa.
For finite Abelian group, there is a theorem that assimilates this uncertainty principle, so people refer to this group-theoretic version as “uncertainty principle” too. However, our focus today will be Tao’s uncertainty principle for cyclic group of prime order[2], which is a refinement of the discrete uncertainty principle due to the special structure of the group.

First, let us introduce some notations and redefine the Discrete Fourier transform as Andrew did in Lecture 18. In fact, these materials already appeared in Lecture 8, where we studied the concrete example of the the rank k hypercube group B(k).

Let X be a finite set and denote by L(X)=\{f:X\to \mathbb{C}\} the vector space of all complex -valued functions defined on X. Then \mathrm{dim}\ L(X)=|X|, where |\cdot| denotes cardinality.

For x\in X we denote by \delta_x the Dirac function centered at x, that is , the element \delta_x\in L(X) defined by \delta_x(y)=1 if y=x and \delta_x(y)=0 if y\neq x for all y\in X.

The set \{\delta_x:x\in X\} is a natural basis for L(X) and if f\in L(X) then f=\sum_{x\in X}f(x)\delta_x.

The space L(X) is endowed with the scalar product defined by setting

\langle f_1,f_2\rangle=\sum_{x\in X}f_1(x)\overline{f_2(x)}

for f_1,f_2\in L(X), and we denote by ||f||=\sqrt{\langle f,f\rangle} the norm of f\in L(X). Note that {\delta_x:x\in X} is an orthonormal basis of L(X) with respect to \langle\cdot\rangle.

Let A be a finite Abelian group. A character of A is a map \chi:A\to\mathbb{T} such that

\chi(x+y)=\chi(x)\chi(y)

for all x,y\in A. The set \widehat{A} of all characters of A is an Abelian group with respect to the product (\chi,\psi)\mapsto \chi\cdot\psi defined by (\chi\cdot\psi)(x)=\chi(x)\psi(x) for all x\in A. \widehat{A} is called the dual of A.

For A=\mathbb{Z}_n, where \mathbb{Z}_n is the cyclic group of order n, n\ge 2, n\in \mathbb{N}. Let \omega:=e^{\frac{2\pi i}{n}}. For x\in \mathbb{Z}_n, define \chi_x\in L(\mathbb{Z}_n) by the function z\mapsto \omega^{zx}.

Exercise 1: Check that for x\in \mathbb{Z}_n, \chi_x is indeed a character of \mathbb{Z}_n, and \widehat{\mathbb{Z}_n}=\{\chi_x:x\in \mathbb{Z}_n\} is isomorphic to \mathbb{Z}_n.

Now we define the Discrete Fourier transform. Let A be a finite Abelian group. The Fourier transform of a function f\in L(A) is the function \hat{f}\in L(\widehat{A}) defined by

\widehat{f}(\chi)=\sum_{y\in A}f(y)\overline{\chi(y)}

for all \chi\in \widehat{A}. Then \widehat{f}(\chi) is called the Fourier coefficient of f with respect to \chi. When A=\mathbb{Z}_n and f\in L(\mathbb{Z}_n), we call \frac{1}{n}\hat{f} the Discrete Fourier transform (DFT) of f. Specifically, for x\in\mathbb{Z}_n and \omega=e^{\frac{2\pi i}{n}}, the DFT of f is

\hat{f}(x)=\frac{1}{n}\sum_{y\in\mathbb{Z}_n}f(y)\omega^{-xy}.

The uncertainty principle is the following (see Theorem 2.5.4 of [1]):
Theorem (Uncertainty principle): Let f\in L(A) and suppose that f\neq 0. Then

|\mathrm{supp}(f)|\cdot|\mathrm{supp}\hat{f}|\ge |A|.

Now we come to the main part of this lecture: Tao’s uncertainty principle for cyclic group of prime order. This improves the inequality in the above discrete uncertainty principle when the finite Abelian group A is cyclic with prime order p. To prove the theorem, we present some specific tools developed in Tao’s paper [2].

The main theorem in [2] is the following:
Theorem (Tao’s uncertainty principle): Let p be a prime number and f\in L(\mathbb{Z}_p) non-zero. Then

|\mathrm{supp}(f)|+|\mathrm{supp}(\hat{f})|\ge p+1.

Conversely, if \emptyset\neq A,A'\subset \mathbb{Z}_p are two subsets such that |A|+|A'|=p+1, then there exists f\in L(\mathbb{Z}_p) such that \mathrm{supp}(f)=A and \mathrm{supp}(\hat{f})=A'.

To prove the theorem, we present some specific tools developed in Tao’s paper [2].

Recall that \mathbb{Z}[x] denotes the ring of polynomials with integer coefficients.

Proposition 1:
Let P(x_1,x_2,\cdots,x_n) be a polynomial in the variables x_1,x_2,\cdots,x_n with integer coefficients. Suppose that for some i\neq j,

P(x_1,x_2,\cdots,x_n)|_{x_i=x_j}\equiv 0.

Then there exists a polynomial Q(x_1,x_2,\cdots,x_n) with integer coefficients such that

P(x_1,x_2,\cdots,x_n)=(x_i-x_j)Q(x_1,x_2,\cdots,x_n).


Remark: The proof of this proposition can be omitted for the first reading. Instead, considering P(x_1,x_2)=x_1^2+x_1x_2-2x_2^2 and trying to find corresponding Q(x_1,x_2) by using this proposition would be helpful for understanding.

Proof:
For the sake of simplicity, suppose that i = 1 and j = 2 so that P(x_1,x_2,\cdots,x_n) \equiv 0. Let us denote by P_1(x_1,x_2,\cdots,x_n) (respectively P_2(x_1,x_2,\cdots,x_n)) the sum of the monomials of P(x_1,x_2,\cdots,x_n) with positive (respectively negative) coefficients so that

P(x_1,x_2,\cdots,x_n) = P_1(x_1,x_2,\cdots,x_n) + P_2(x_1,x_2,\cdots,x_n).

Note that

P_1(x_1,x_1,x_3,\cdots,x_n) =- P_2(x_1,x_1,x_3,\cdots,x_n),

since P(x_1 , x_1 ,x_3, \cdots , x_n ) \equiv 0. This implies that there exists a bijection between the monomials in P_1(x_1,x_1,x_3,\cdots,x_n) and those in P_2(x_1,x_1,x_3,\cdots,x_n). More precisely, let us fix m > 0 and k, k_3 , \cdots, k_n \ge 0; then the monomial mx_1^k x_3^{k_3}\cdots x_n^{k_n} appears in P_1(x_1,x_1,x_3,\cdots,x_n) if and only if -mx_1^k x_3^{k_3}\cdots x_n^{k_n} appears in P_2(x_1,x_1,x_3,\cdots,x_n). Suppose this is the case. Then we can find m_0 , m_1 , \cdots , m_k and n_0 , n_1 , \cdots, n_k non-negative integers such that the sum of the monomials of P(x_1,x_1,x_3,\cdots,x_n) whose variables x_i have degree k_i for i = 1, 2,\cdots , n and k_1 + k_2 = k is


\sum_{l=0}^k m_lx_1^{k-l}x_2^lx_3^{k_3}\cdots x_n^{k_n}-\sum_{l=0}^k n_lx_1^{k-l}x_2^lx_3^{k_3}\cdots x_n^{k_n}

and

m_0+m_1+\cdots+m_k=n_0+n_1+\cdots+n_k=m

but also such that

m_l\neq 0\implies n_l=0\quad n_l\neq 0\implies m_l\neq 0

(because otherwise there would be a cancellation). Thus, with every monomial x_1^{k-l}x_2^lx_3^{k_3}\cdots x_n^{k_n} such that m_l\neq 0 we can (arbitrarily but bijectively) associated a monomial x_1^{k-h}x_2^lx_3^{k_3}\cdots x_n^{k_n} with m_h\neq 0 and h\neq l. Now for h>l we have the identity


x_1^{k-l}x_2^l-x_1^{k-h}x_2^h=x_2^lx_1^{k-h}(x_1^{h-l}-x_2^{h-l})

=x_2^lx_1^{k-h}(x_1-x_2)(x_1^{h-l-1}+x_1^{h-l-2}x_2+\cdots+x_2^{h-l-1}).


Exchanging h with l we get the analogous identity for h<l. This shows that

\sum_{l=0}^k m_lx_1^{k-l}x_2^lx_3^{k_3}\cdots x_n^{k_n}-\sum_{l=0}^k n_lx_1^{k-l}x_2^lx_3^{k_3}\cdots x_n^{k_n}

is divisible by x_1-x_2.

Repeating the argument for each monomial mx_1^kx_3^{k_3}\cdots x_n^{k_n} with m>0 and k,k_3,\cdots,k_n\ge 0 appearing in P_1(x_1,\cdots,x_n), we deduce that P(x_1,\cdots,x_n) is divisible by x_1-x_2.

Lemma 1: Let p be a prime, n a positive integer, and P(x_1,x_2,\cdots,x_n) a polynomial with integer coefficients. Suppose that \omega_1,\cdots,\omega_n are (not necessarily distinct) pth roots of unity such that P(\omega_1,\cdots,\omega_n)=0. Then P(1,\cdots,1)=0 is divisible by p.
Proof:
Letting \omega=e^{\frac{2\pi i}{p}} we can find integers 0\le k_j\le p-1 such that \omega_j=\omega^{k_j} for j=1,2,\cdots,n.

Define the polynomials q(x),r(x)\in\mathbb{Z}[x] by settinig

P(x^{k_1},\cdots,x^{k_n})=(x^p-1)q(x)+r(x)

where \mathrm{deg}r<p. Then r(\omega)=0 and since \mathrm{deg}r<p we deduce that r(x) is a multiple of the minimal polynomial of \omega, that is, r(x)=m(1+x+\cdots+x^{p-1}) for some m\in\mathbb{Z}, where we used a fact in undergraduate algebra that the minimal polynomial of \omega is 1+x+\cdots+x^{p-1}. It follows that P(1,\cdots,1)=r(1)=mp.

Theorem 3(Chebotarev):
Let p be a prime and 1\le n\le p. Let \eta_1,\cdots,\eta_n (respectively \xi_1,\cdots,\xi_n) be the distinct elements in \{0,1,\cdots,p-1\}. Then the matrix

A=\left(e^{\frac{2\pi i\eta_h\xi_k}{p}}\right)_{h,k=1}^n

is nonsingular.

Proof: Set \omega_h=e^{\frac{2\pi i\eta_h}{p}} for h=1,2,\cdots,n. Note that the \omega_hs are distinct pth root of unity and A=\left(\omega_h^{\xi_k}\right)_{h,k=1}^n. Define the polynomial D(x_1,\cdots,x_n) with integer coefficients by setting

D(x_1,\cdots,x_n)=\mathrm{det}\left(x_h^{\xi_k}\right)_{h,k=1}^n.

As the determinant is an alternating form, we have D(x_1,\cdots,x_n)|{x_h=x_k}\equiv 0 whenever 1\le h\neq k\le n, so that, by recursively applying Proposition 1, we can find a polynomial Q(x_1,\cdots,x_n) with integer coefficients such that

D(x_1,\cdots,x_n)=Q(x_1,\cdots,x_n)\prod_{1\le h<k\le n}(x_k-x_h).

To prove the theorem, it is equivalent to show that Q(\omega_1,\cdots,\omega_n)\neq 0 (because \omega_hs are all distinct) so that, by Lemma 1 it suffices to show that p does not divide Q(1,\cdots,1). For this we need the next three lemmas to complete this proof. Let us first introduce some notations.

Given an n-tuple \mathbf{k}=(k_1,\cdots,k_n) of non-negative integers, we say that the (monomial) differential operator

L=\mathbf{L_k}=\left(x_1\frac{\partial}{\partial x_1}\right)^{k_1}\cdots\left(x_n\frac{\partial}{\partial x_n}\right)^{k_n}

is of type \mathbf{k} and order o(\mathbf{k})=k_1+\cdots+k_n.

Lemma 2:
Let L be a differential operator of type \mathbf{k} and F(x_1,\cdots,x_n) and G(x_1,\cdots,x_n) two polynomials. Then

L(FG)=\sum_{(\mathbf{i},\mathbf{j})}L_{\mathbf{i}}(F)\cdot L_{\mathbf{j}}(G)

where the sum runs over all pairs (\mathbf{i},\mathbf{j}) such that (componentwise) \mathbf{i}+\mathbf{j}=\mathbf{k} (and therefore o(\mathbf{i})+o(\mathbf{j})=o(\mathbf{k})).
Proof:
Left as as exercise.

Exercise 2: Prove Lemma 2. (Hint: use induction on the order of L.)

Lemma 3:
For 1\le j\le n and 1\le h\le j-1 we have

\left(x_j\frac{\partial}{\partial x_j}\right)^{h}(x_j-x_1)(x_j-x_2)\cdots (x_j-x_{j-1})=\sum_{t=1}^ha_{h,t}x_j^t\sum_{\mathbf{i}_t}\prod_{1\le i\le j-1,i\neq i_1,\cdots,i_t}(x_j-x_i)

where \sum_{\mathbf{i}_t} runs over all \mathbf{i}_t=(i_1,\cdots,i_t) with 1\le i_1<\cdots<i_t\le j-1 and the a_{h,t}=a_{h,t}(j)s are non-negative integers such that a_{h,h}=h!. In particular,

\left(x_j\frac{\partial}{\partial x_j}\right)^{j-1}(x_j-x_1)(x_j-x_2)\cdots (x_j-x_{j-1})=(j-1)!x_j^{j-1}+\text{terms containing at least one factor } (x_j-x_i)

with 1\le i<j.
Proof:
Proof by induction on h=1,\cdots,j-1. For the details see Lemma 2.6.15 of section 2.6 of [1].

Lemma 4:
Let L=L_{(0,1,\cdots,n-1)}, that is,

L=\left(x_1\frac{\partial}{\partial x_1}\right)^{0}\cdots\left(x_n\frac{\partial}{\partial x_n}\right)^{n-1}.

Then if D(x_1,\cdots,x_n) and Q(x_1,\cdots,x_n) are given by

D(x_1,\cdots,x_n)=Q(x_1,\cdots,x_n)\prod_{1\le h<k\le n}(x_k-x_h),

we have

[LD](1,\cdots,1)=\prod_{j=1}^n (j-1)!Q(1,\cdots,1).


Proof:
By Lemma 2 and 3, we have

[LD](1,\cdots,1)\prod_{j=1}^n(j-1)!x_j^{j-1}Q(x_1,\cdots,x_n)+\text{terms containing at least one factor }(x_j-x_i)

with 1\le i<j\le n. Here the first term is obtained by acting the whole L on \prod_{1\le h<k\le n}(x_k-x_h) (the second term of D). In particular, taking x_i=1 for i=1,\cdots,n we complete the proof.

Now we can finish the proof of Theorem 3 with these lemmas.


Proof of Theorem 3 (continued):
For L as in Lemma 4, we have (where S_n denotes the symmetric group of degree n)

[LD](x_1,\cdots,x_n)=L\sum_{\sigma\in S_n}\epsilon(\sigma)x_1^{\xi_{\sigma(1)}}\cdots x_n^{\xi_{\sigma(n)}}

=\sum_{\sigma\in S_n}\epsilon(\sigma)\xi^0_{\sigma(1)}x_1^{\xi_{\sigma(1)}}\cdots \xi_{\sigma(n)}^{n-1}x_n^{\xi_{\sigma(n)}}

since

\left(x_j\frac{\partial}{\partial x_j}\right)^{j-1}x_j^{\xi_{\sigma(j)}}=\xi_{\sigma(j)}^{j-1}x_j^{\xi_{\sigma(j)}}

for all j=1,2,\cdots,n. Thus,

[LD](1,\cdots,1)=\sum_{\sigma\in S_n}\epsilon(\sigma)\xi_{\sigma(1)}^0\cdots\xi_{\sigma(n)}^{n-1}=\prod_{1\le i<j\le n}(\xi_j-\xi_i)

is the Vandermonde determinant. Since \xi_i\neq \xi_j for i<j, we deduce that [LD](1,\cdots,1) is not divisible by p because n\le p. Thus, by Lemma 4, Q(1,\cdots,1) is not divisible by p either. By Lemma 1, this completes the proof of Theorem 3.

Given a non-empty subset A\subset \mathbb{Z}_p and a function f\in L(A), we denote by \overline{f} its extension \overline{f}:\mathbb{Z}_p\to \mathbb{C} defined by setting \overline{f}(z)=0 for all z\in \mathbb{Z}_p\backslash A. For simplicity, we regard the DFT as a map L(\mathbb{Z}_p)\to L(\mathbb{Z}_p). In other words, for f\in L(\mathbb{Z}_p) and x\in \mathbb{Z}_p,

\hat{f}(x)=\frac{1}{p}\sum_{y\in\mathbb{Z}_p}f(y)\omega^{-xy}.

Proposition 2:
Let p be a prime. Let A,B\subset\mathbb{Z}_p such that |A|=|B|. Then the linear map T=T{A,B}:L(A)\to L(B) defined by Tf=\widehat{\overline{f}}|B is invertible.

Proof:

Set A={\xi_1,\cdots,\xi_n} and B={\eta_1,\cdots,\eta_n} and consider the basis of L(A) (respectively, of L(B)) consisting of the Dirac functions \delta_{\xi_j} with j=1,\cdots,n (respectively, \delta_{\eta_k} with k=1,\cdots,n) and let \omega=e^{\frac{2\pi i}{p}}. Then we have

[T\delta_{\xi_k}](\eta_h)=\widehat{\delta_{\xi_k}}(\eta_h)=\sum_{x\in\mathbb{Z}p}\delta_{\xi_k}(x)\omega^{-x\eta_h}=\omega^{-\eta_h\xi_k}.

By virtue of Theorem 3, we have

\mathrm{det}\left([T\delta_{\xi_k}](\eta_h)\right)_{h,k=1}^n\neq 0,

showing that T is indeed invertible.

Finally, we can prove the main result, Theorem 2.

Proof of Theorem 2:
Suppose by contradiction that there exists f\in L(\mathbb{Z}_p) non-zero such that \mathrm{supp}(f)=A and \mathrm{supp}(\hat{f})=C with |A|+|C|\ge p+1. Then we can find a subset B\subset \mathbb{Z}_p such that |B|=|A| and C\cap B=\emptyset. We deduce that Tf=\widehat{\overline{f}}|_B is identicially zero. Since f\not\equiv 0, this contradicts the injectivity of T (c.f. Proposition 2).

Conversely, let \emptyset\neq A,A'\subset\mathbb{Z}p be two subsets such that |A|+|A'|=p+1. Let B\subset\mathbb{Z}_p such that |B|=|A| and B\cap A'={\xi}. Note that (\mathbb{Z}_p\backslash B)\cup{\xi}\supset A'. Taking cardinalities, |A'|=p+1-|A|=p+1-|B|=|(\mathbb{Z}_p\backslash B)\cup{\xi}|\ge |A'|, which yields

(\mathbb{Z}_p\backslash B)\cup{\xi}=A'.

Consider the map T=T_{A,B}:L(A)\to L(B). By Proposition 2, we can find g\in L(A) such that Tg=\delta_\xi|_B so that \widehat{\Bar{g}} vanishes on B\backslash{\xi} but \widehat{\Bar{g}}(\xi)\neq 0. Setting f=\Bar{g}\in L(\mathbb{Z}_p) we clearly have \mathrm{supp}(f)\subset A and \mathrm{supp}(\hat{f})\subset (\mathbb{Z}_p\backslash B)\cup{\xi}. By the first part of the theorem,

p+1\le |\mathrm{supp}(f)|+|\mathrm{supp}(\hat{f})|\le |A|+|\mathbb{Z}_p\backslash B|+1 =|A|+p-|B|+1=p+1

so that all the inequalities are in fact equalities, i.e., \mathrm{supp}(f)=A and \mathrm{supp}(\hat{f})= (\mathbb{Z}_p\backslash B)\cup{\xi}=A'. This completes the second part of the theorem.

An immediate consequence of Theorem 2 is that any sparse polynomial \sum_{j=0}^k c_jz^{n_j} with k+1 non-zero coefficients and 0<n_0<\cdots<n_k<p, when restricted to the pth roots of unity \{z:z^p=1\}, can have at most k zeroes. Indeed, such a polynomial is essentially the Fourier transform in \mathbb{Z}_p of a function whose support has cardinality k+1, and so the support of the polynomial must contain at least p-k pth roots of unity by Theorem 2, and the claim follows.

References:

Math 202C: Lecture 19

Author: Qihao Ye

Andrew Dennehy have already introduced the concept of the discrete Fourier transform (DFT) to us in Lecture 18, but I would like to retake the path with more details, because there are some other concepts (which help for fully understanding) I would like to talk about.

We mainly talk about how fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT) work, we sightly talk about the calculation error.

We will see how FFT and IFFT work from both the perspective of math and computer, along with their applications and some specific problems. Some problem may involve the number theoretic transform (NTT), which is recognized as the integer DFT over finite field.

We use Python3 for code examples, since we would not use version related feature, the specific version does not matter. (For instance, Python3.5.2 and Python3.8.1 are both ok.) We would call Python instead of Python3 in the following content. (If you do not install Python, there are many online interpreters that can run Python codes, for instance, Try It Online.) Recommend to view this page with computer for getting the best experience. Recommend to try the codes and solve some problems by yourself, which definitely would be challenging and interesting.

There are only 2 exercises in this lecture, other problems are interest based algorithm problems (no need to do as homework), which might be difficult to solve if you are not very familiar with the FFT and NTT in this form, but I will give hints to guide you.

Now let us start with an easy example.

Example 1: We consider two polynomials

Multiply them together (using distributive property), we would get

Definition 1 (coefficient representation): For a polynomial P(x) = a_{0} + a_{1} x + \cdots + a_{n} x^{n}, the list [a_{0}, a_{1}, \ldots, a_{n}] is its coefficient representation. We denote P[k] as the coefficient of x^{k} in P(x).

Use definition 1, we can write P_{1}(x) as [1, 2, 3, 4], P_{2}(x) as [2, 3, 4, 5] and P_{1}(x) \cdot P_{2}(x) = Q(x) as [2, 7, 16, 30, 34, 31, 20].

The naïve polynomial multiplication function is not hard to get:

def naive_poly_mul(P1, P2):
    Q = [0] * (len(P1) + len(P2) - 1)
    for i in range(len(P1)):
        for j in range(len(P2)):
            Q[i + j] += P1[i] * P2[j]
    return Q

In the general case, i.e.,

it is easy to see that the complexity of the naïve polynomial multiplication function is \mathcal{O}(m n). Note that \text{len}(P_{1}) = n + 1 (count from 0 to n). If we consider the specific condition m = n, then the complexity becomes \mathcal{O}(n^{2}).

Definition 2 (degree of polynomial): The degree of a polynomial is the highest of the degrees of the polynomial’s monomials (individual terms) with non-zero coefficients.

In Example 1, P_{1}, P_{2} are both 3-degree polynomials.

Definition 3 (value representation): Except representing a n-th degree polynomial with n + 1 coefficients, we can also represent a n-th degree polynomial with proper (see below Exercise) n + 1 points on the polynomial, which is called the value representation.

Exercise 1 : Prove that n + 1 points \{ (x_{i}, y_{i}) \}_{i = 1}^{n + 1} with distinct \{ x_{i} \} determine a unique polynomial of degree n. Hint: use the fact that a Vandermonde matrix is nonsingular without proof.

Back to Example 1, to get Q, we can first get \{ (x_{i}, P_{1}(x_{i}), P_{2}(x_{i})) \}_{i = 1}^{7} from P_{1}, P_{2} with distinct \{ x_{i} \}, then the 7 points \{ (x_{i}, P_{1}(x_{i}) P_{2}(x_{i})) \}_{i = 1}^{7} just represent Q.

When the degree of P_{1}, P_{2} are both n, we can see that the multiplication here just needs complexity \mathcal{O}(n).

However, at most time, we only need the coefficient representation, because it is more suitable to calculate the values in its domain. It is not hard to figure out we can do the multiplication in this way:

If we only consider the situation that we aim to multiply two n-th degree polynomials, the multiplication part only costs \mathcal{O}(n), so the bottleneck of the complexity lies in the algorithm changing the coefficient representation into the value representation and the algorithm changing it back.

The naïve way to get the value representation of a coefficient represented polynomial cost at least \mathcal{O}(n^{2}). (To get each value we need \mathcal{O}(n), and we totally need \mathcal{O}(n) values.)

The essential idea is selecting specific points to reduce the calculation cost.

The straight thought would be looking at the parity of a function. Because for any odd function P_{\text{odd}}, we have P_{\text{odd}}(-x) = - P_{\text{odd}}(x) while for any even function P_{\text{even}}, we have P_{\text{even}}(-x) = P_{\text{even}}(x).

Actually, we can divide any polynomial into one odd function plus one even function. Take a look back to Q(x) in Example 1, we can write

Notice that Q_{\text{odd}}(x^{2}) is actually an even function, but write in this form allows us to take x^{2} as the variable. It follows that

Note that Q_{\text{even}}(x^{2}) has degree 3 and Q_{\text{odd}}(x^{2}) has degree 2 while Q has degree 6. Once we get Q_{\text{even}}(x^{2}) and Q_{\text{odd}}(x^{2}), we would immediately get Q(x) and Q(-x).

What we only need to do is recursive process Q_{\text{even}}(x^{2}) and Q_{\text{odd}}(x^{2}). It would lead to an \mathcal{O}(n \log n) recursive algorithm.

However, we would encounter the problem that the symmetric property could not be maintained further (we can pair x and -x, but how to pair x^{2}).

As we already see in the previous Lecture, we can use the roots of unity to solve this problem. Denote \omega_{n} = \exp(2 \pi i / n). Note that \omega_{n}^{n} = \omega_{n}^{0} = 1.

To make it easier to understand, we now only consider n = 2^{k} for some positive integer k, we would evaluate polynomial at [\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n - 1}], then [\omega_{n}^{0}, \omega_{n}^{2}, \ldots, \omega_{n}^{n - 2}], next [\omega_{n}^{0}, \omega_{n}^{4}, \ldots, \omega_{n}^{n - 4}], etc. Just as the figure below.

Diagram 1

We pair each \omega_{n}^{j} with \omega_{n}^{-j} = \omega_{n}^{j + n / 2}. For any polynomial P, we can get [P(\omega_{n}^{0}), P(\omega_{n}^{1}), \ldots, P(\omega_{n}^{n - 1})] fast by evaluating [P_{\text{even}}(\omega_{n}^{0}), P_{\text{even}}(\omega_{n}^{2}), \ldots, P_{\text{even}}(\omega_{n}^{n - 2})] and [P_{\text{odd}}(\omega_{n}^{0}), P_{\text{odd}}(\omega_{n}^{2}), \ldots, P_{\text{odd}}(\omega_{n}^{n - 2})] recursively. Recall that we divide P(x) as P_{\text{even}}(x^{2}) + x P_{\text{odd}}(x^{2}). The corresponding formula is

This leads to the naive FFT code:

from math import pi
from math import sin
from math import cos

def get_omega(n):
    theta = 2 * pi / n
    omega = cos(theta) + sin(theta) * 1j
    return omega

def naive_FFT(P):
    # P is the coefficient representation
    # P = [a_{0}, a_{1}, ..., a_{n-1}]
    # current we assume n = 2^{k}
    n = len(P)
    half_n = n // 2
    if n == 1:
        return P # constant function
    omega = get_omega(n)
    P_even, P_odd = P[::2], P[1::2]
    V_even, V_odd = naive_FFT(P_even), naive_FFT(P_odd)
    V = [0] * n # the value representation
    for j in range(half_n):
        V[j] = V_even[j] + omega ** j * V_odd[j]
        V[j + half_n] = V_even[j] - omega ** j * V_odd[j]
    return V

Feed P_{1} in example 1 as input, we would get

Now we can apply Fourier transform to a coefficient representation to get the corresponding value representation, and the multiplication in the value representation form is easy to implement, what remains to solve is the inverse Fourier transform.

In the matrix-vector form for P(x) = a_{0} + a_{1} x + \cdots + a_{n - 1} x^{n - 1}, we have

Note that the character table of C_{n} is just as the form of the DFT matrix:

To get the inverse Fourier transform, we can just use the inverse of the matrix:

Exercise 2: Check that the inverse DFT matrix is the inverse of the DFT matrix.

The difference between DFT and FFT is just the way of calculating these matrix-vector forms, where DFT uses the direct matrix-vector multiplication way in \mathcal{O}(n^{2}) and FFT uses the tricky recursive way to achieve \mathcal{O}(n \log n).

The inverse DFT matrix leads to the naive IFFT code:

def naive_IFFT(V, is_outest_layer=False):
    # V is the value representation
    # w means omega_{n}
    # V = [P(w^{0}), P(w^{1}), ..., P(w^{n-1})]
    # current we assume n = 2^{k}
    n = len(V)
    half_n = n // 2
    if n == 1:
        return V # constant function
    omega = 1.0 / get_omega(n) # omega_{n}^{-1}
    V_even, V_odd = V[::2], V[1::2]
    P_even, P_odd = naive_IFFT(V_even), naive_IFFT(V_odd)
    P = [0] * n # the value representation
    for j in range(half_n):
        P[j] = P_even[j] + omega ** j * P_odd[j]
        P[j + half_n] = P_even[j] - omega ** j * P_odd[j]
    if is_outest_layer:
        for j in range(n):
            P[j] /= n
    return P

Use P_{1} in example 1, we would get

If we ignore the little error, this is just the coefficient representation of P_{1}.

The following materials might not be that clear and might not be easy to understand, but I will try my best. (Some material cannot be expanded too much, otherwise that would cost too much space and might confuse the main part.)

The lowest layer of Diagram 1 is just the bit-reversal permutation index and there is a neat code to generate:

def get_BRI(length):
    # Bit-Reversal Index
    n = 1
    k = -1
    while n < length:
        n <<= 1
        k += 1
    BRI = [0] * n
    for i in range(n):
        BRI[i] = (BRI[i >> 1] >> 1) | ((i & 1) << k)
    return BRI

It is more easy to see in a tabular (an example of 8-length BRI)

Use this we can implement the Cooley–Tukey FFT algorithm, which is the most common FFT algorithm. Further, with proper manner of coding, we can devise an in-place algorithm that overwrites its input with its output data using only \mathcal{O}(1) auxiliary storage, which is called the iterative radix-2 FFT algorithm. Moreover, since the form of FFT and IFFT are actually very similar, we can integrate them together.

def FFT(X, length, is_inverse=False):
    # X : input, either coefficient representation
    #            or value representation
    # length : how much values need to evaluate
    # is_inverse : indicate whether is FFT or IFFT
    inverse_mul = [1, -1][is_inverse]
    BRI = get_BRI(length)
    n = len(BRI)
    X += [0] * (n - len(X))
    for index in range(n):
        if index < BRI[index]: # only change once
            X[index], X[BRI[index]] = X[BRI[index]], X[index]
    bits = 1
    while bits < n:
        omega_base = cos(pi/bits) + inverse_mul * sin(pi/bits) * 1j
        j = 0
        while j < n:
            omega = 1
            for k in range(bits):
                even_part = X[j + k]
                odd_part = X[j + k + bits] * omega
                X[j + k] = even_part + odd_part
                X[j + k + bits] = even_part - odd_part
                omega *= omega_base
            j += bits << 1
        bits <<= 1
    if is_inverse:
        for index in range(length):
            X[index] = X[index].real / n
            # only the real part is needed
    return X[:length]

Note that we could ignore the return part, since X is already changed. This algorithm would extend the input length to its closest larger bit number (of form 2^{k}), but under most condition, we would take the length as 2^{k} before we use this algorithm (adding 0‘s).

Because we use the complex number to implement the FFT algorithm, we can see that the error is hard to eliminate. Even though the initial polynomial is integer based, apply FFT to it, then apply IFFT, we would get a decimal list with some calculation error.

If we do the FFT in the field \mathbb{Z} / P \mathbb{Z}, where P = Q \cdot 2^{k} + 1, denote g as a primitive root modulo P, then \{ 1, g^{Q}, g^{2 Q}, \ldots \} is a cyclic group of order 2^{k}, we can replace \omega_{n} with g to do the FFT. This method is called NTT, since only integers are involved, the errors are not possible to appear.

For arbitrary modulo m, the aiming NTT length n, we can take a set of distinct NTT modulo \{ p_{i} \}_{i = 1}^{r} satisfies

do NTT respectively on all p_{i}, then use the Chinese remainder theorem to combine them together getting the final result modulo m.

Note that during the NTT algorithm, the maximum intermediate value would not exceed n (m - 1)^{2}.

We may say the FFT algorithm solves the convolution in the form of

in time \mathcal{O}(n \log n).

Back in Example 1, we have

(Not mandatory) Problem 1: Give the formula of Stirling numbers of the second kind:

Use the NTT with some modulo of the form P = Q \cdot 2^{k} + 1 to calculate all S(n, k) for 0 \leq k \leq n in time complexity \mathcal{O}(n \log n).

(Not mandatory) Problem 2: PE 537. Hint: If denote A as the list of the value of \pi(x), i.e., A[n] = \sum_{x} [\pi(x) = n], then the convolution of A and A, named B, is the list of the value of \pi(x) + \pi(y), i.e., B[n] = \sum_{x, y} [\pi(x) + \pi(y) = n], then the convolution of B and B, named C, is the list of the value of \pi(x) + \pi(y) + \pi(z) + \pi(w), i.e., C[n] = \sum_{x, y, z, w} [\pi(x) + \pi(y) + \pi(z) + \pi(w) = n], etc. You can consider A, B, C as the generating function. You might need to learn how to sieve prime numbers and use fast exponentiation.

There are bunch of similar algorithms, for example, fast Walsh–Hadamard transform (FWHT) and fast wavelet transform (FWT). FWHT can solve the general convolution

in time complexity \mathcal{O}(n \log n), where \star is some binary operation, usually \star is bitwise OR, bitwise AND, and bitwise XOR.

Using FFT, we can do a lot of things for polynomials fast, for instance, for a polynomial P(x), we can find a polynomial Q(x), such that P(x) Q(x) \equiv 1 \mod{x^{n}}, this Q(x) is called the inverse of P(x) under modulo x^{n}. The basic idea is to find the inverse polynomial under modulo x^{n / 2}, then x^{n / 4}, etc. Because the inverse polynomial under modulo x^{1} is trivial, we can solve this recursively. Similar idea may be apply to the Newton’s method under modulo x^{n}, specifically, we can find the square root of a polynomial under modulo x^{n}.

(Not mandatory) Problem 3: PE 258. Hint: Consider the Cayley–Hamilton theorem (x^{2000} - x - 1 = 0) and use the polynomial inverse to do polynomial quotient on x^{10^{18}}. Or consider solving the homogeneous linear recurrence with constant coefficients by the Berlekamp–Massey algorithm, which involves polynomial multiplication. Note that 20092010 = 8590 \times 2339.

FFT is also used to transform the time domain to the frequency domain in the signal area, while IFFT is used to transform reverse.

In the artical [Machine Learning from a Continuous Viewpoint I] by Weinan E, the Fourier representation

can be considered as a two-layer neural network model with activation function \sigma.

In the above figure, \sigma(\boldsymbol{\omega}, \boldsymbol{x}) calculates each hidden layer value using the input layer \boldsymbol{x} with weight \boldsymbol{\omega}, \int_{\mathbb{R}^{d}} a(\boldsymbol{\omega}) \sigma(\boldsymbol{\omega}, \boldsymbol{x}) d \boldsymbol{\omega} sums all the hidden layer with weight a(\boldsymbol{\omega}). Note this is an integral formula, we work on a continuous condition, which means that the hidden layer has infinite width (the hidden layer is considered to have infinite nodes).

References

Math 202C: Lecture 18

Written by a countably infinite number of monkeys with an uncountably infinite number of typewriters, edited by Andrew Dennehy

In a bold twist on the standard formula, in this lecture post we will be discussing math. Specifically, we will be discussing the Cohn-Uman algorithm and how representation theory can provide us an algorithm to multiply two matrices together. We will then see how this perspective could potentially help us prove an important theorem in complexity theory, namely that there exists an algorithm which computes the product of two n\times n matrices in O(n^{2+\epsilon}) time for any \epsilon>0. Before we get to that, though, we will first cover an algorithm with many similarities to the one we will be discussing.

Consider the problem of multiplying two polynomials

p(x)=p_0+p_1x+\cdots +p_{n-1}x^{n-1}

q(x)=q_0+q_1x+\cdots +q_{n-1}x^{n-1}

The naïve approach would be the one that’s taught in high school, namely finding all the a_ib_j terms and using them to calculate the coefficients of p(x)q(x). However, we’re significantly cooler than high school students, so we’re going to use some significantly cooler math. First, we introduce the concept of the discrete Fourier transform. Let \omega_n=e^{i \frac{2\pi}{n}} be the n-th root of unity. The DFT of a vector v given by

\hat{v} := \begin{bmatrix}1 & 1 & 1 & \cdots & 1 \\  1 & \omega_n &  \omega_n^2 & \cdots &  \omega_n^{n-1} \\ \vdots & \vdots  & \vdots  & \ddots  & \vdots \\  1 & \omega_n^{n-1} &  \omega_n^{2(n-1)} & \cdots &  \omega_n^{(n-1)^2}  \end{bmatrix} v

Equivalently, if v_i is the i-th component of v,

\hat{v}_i := v_0+v_1  \omega_n^{k} + \cdots + v_n\omega_n^{nk}

So, if we apply this transformation to the vector of coefficients of p, we get that

\hat{p}_k= p_0+p_1  \omega_n^{k} + \cdots + p_n\omega_n^{nk} =p(\omega_n^k)

So, the problem of finding the Fourier coefficients of p is equivalent to the problem of evaluating p at 1,\omega_n,\omega_n^2,...,\omega_n^{n-1}. From now on, assume n is a power of 2, since if it isn’t we can just add a bunch of 0 coefficients until we have a power of 2. Define the following two polynomials:

p_0(x)=p_0+p_2x + \cdots + p_{n-2} x^{n/2-1}

p_1(x)=p_1+p_3x + \cdots + p_{n-1} x^{n/2-1}

This lets us decompose p as

p(x)=p_0(x^2)+xp_1(x^2)

So, instead of evaluating p at all those powers of \omega_n, we can instead evaluate p_0 and p_1 at 1,\omega_n^2, ..., \omega_n^{2(n-1)} and then combine the results. So, we have reduced the problem of plugging n numbers into a degree n polynomial into plugging n/2 numbers (specifically the n/2-th roots of unity) into two degree n/2 polynomials. Because we’ve assumed that n is a power of 2, so is n/2, meaning we can repeat the same algorithm for p_0 and p_1. We can apply this process repeatedly, which gives us a recursive algorithm with complexity O(n \log n) for calculating the coefficients. Similarly, one can show that we can invert the DFT by applying the same process to the polynomial

\hat p (x)=\hat{p}_0+\hat{p}_1 x + \cdots + \hat{p}_n x^{n}

with \omega_n replaced with \omega_n^{-1}=\omega_n^{n-1} and then dividing the result by n. The upshot of all of this work is that

\widehat{p q}_k = [pq](\omega_n^k) = p(\omega_n^k) q(\omega_n^k)=\hat p_k \hat q_k

So, we can calculate the Fourier coefficients of pq by calculating the coefficients of each polynomial independently and multiplying them together. Since this has linear complexity and both the Fourier and Inverse Fourier transforms have n\log n complexity, this method gives us an O(n\log n) complexity algorithm to multiply two degree n polynomials.

Notice that instead of using the roots of unity, we could have instead used a generator for C_n and worked over \mathbb C [C_n] with identical results. So, we can frame multiplication of polynomials as a special case of of multiplying elements in \mathbb C [C_n].

Next, we take a slight detour to discuss some group theoretic concepts. For any subset S of a group G (not required to be a subgroup), let the right quotient set Q(S) of S be defined to be

Q(S)=\left\{s_1s_2^{-1}:s_1,s_2\in S\right\}

Exercise 1: Assume that S\neq \emptyset and Q(S)=S. What does this imply about S? Show that the condition that this implies is actually equivalent to Q(S)=S, so the implication goes both ways.

Definition 1: A group G realizes \langle n_1,n_2,n_3 \rangle if there are subsets S_1, S_2, S_3\subseteq G such that |S_i|=n_i and for q_i\in Q(S_i), if

q_1q_2q_3=1

then q_1=q_2=q_3=1. This condition is called the triple product property. Furthermore, this is equivalent to the condition that if q_1q_2=q_3, then q_1=q_2=q_3=1 .

Although it isn’t entirely trivial, one can show that if G realizes \langle n_1,n_2,n_3 \rangle, then it realizes any permutation of those numbers.

Exercise 2: Prove that C_n\times C_m\times C_p realizes \langle n,m,p \rangle.

This is the “trivial realization” of \langle n,m,p \rangle. We now get our first taste of why these ideas could be useful for the problem of matrix multiplication:

Theorem 1: Let F be any field. If G realizes \langle n,m,p \rangle, then the number of field operations required to multiply an n\times m matrix by an m\times p matrix over F is at most the number of operations required to multiply two elements of F[G]

Proof: Let G realize \langle n,m,p \rangle through the subsets S,T,U. Then, for matrices A and B of sizes n\times m and m\times p respectively, we can index the rows of A by S, the columns of A and the rows of B by T, and the columns of B by U. Then, consider the product

\left(\sum\limits_{s\in S,t\in T} A_{st}s ^{-1}  t\right) \left(\sum \limits _{t'\in T, u\in U} B_{t'u} t'^{-1}u\right)

By the triple product property, we have that

(s^{-1}  t)(t'^{-1} u)=s'^{-1} u'\iff (s' s^{-1}) (tt'^{-1} )(uu'^{-1})=1

which is true iff s=s', t=t', and u=u'. So, the coefficient of s^{-1} u in the product is

\sum\limits_{t\in T}A_{st}B_{tu}=(AB)_{su}

So, you can just read off the elements of AB from these coefficients. \square

The process we just described, of converting matrix multiplication to multiplication in F[G], is called the Cohn Uman algorithm. Also, we can now see the connection to the polynomial algorithm we described earlier; we’ve converted a problem of multiplying matrices to a problem of multiplying elements of \mathbb C[G], which is what we did with polynomials before using \mathbb C[C_n].

We can now define a measurement of the “quality” of embedding that a given group G can provide.

Definition 2: The pseudo-exponent \alpha(g) of a non-trivial finite group G is the minimum of

\frac{3\log |G|}{\log nmp}

over all n,m,p (not all 1) such that G realizes \langle n,m,p \rangle.

Lemma 1: For any group G, 2< \alpha(G)\leq 3. Further, if G is abelian, then \alpha(G)=3

Proof: \alpha(G) is clearly at most 3, since we can just take S_1=S_2=\{1\} and S_3=G, meaning G realizes \langle 1,1,|G|\rangle.

To show our lower bound, suppose G realizes \langle n_1,n_2,n_3\rangle (not all equal to 1) through the subsets S_1,S_2,S_3. Then, the triple product tells us that the map (x,y)\mapsto x^{-1}y is injective on S_1\times S_2 and the image only intersects Q(S_3) in the identity. So, |G|\geq n_1n_2 with equality only occurring if n_3=1. The same argument shows this is true for n_2n_3 and n_3n_1, so |G|^3>(n_1n_2n_3)^2. Therefore,

\frac{3 \log |G|}{\log nmp}> \frac{3 \log (nmp)^{2/3}}{\log nmp}=\frac{3 \cdot 2}{3}=2

So, \alpha(G)>2. Finally, if G is abelian, then the product map from S_1\times S_2 \times S_3 to G must be injective (because the triple product property implies the kernel of the map is trivial), so |G|\geq n_1n_2n_3, meaning \alpha(G)\geq 3. \square

Next, we want to relate \alpha(G) to \omega, the exponent of complexity for matrix multiplication, i.e. the smallest number such that we can perform matrix multiplication in O(n^{\omega +\epsilon}) time for any \epsilon>0. To do this, we first recall from 202B that

\mathbb{C}[G]\cong \mathrm{End} \mathbb{C}[G]\cong \mathcal{M}_{d_1\times d_1} \oplus \cdots \oplus \mathcal{M}_{d_k\times d_k}

where the d_i‘s are the dimensions of the irreducible representations of G. Then the most important theorem we will discuss is as follows:

Theorem 2: Suppose that G is a group with character degrees d_1,...,d_k. Then,

|G|^{\omega/\alpha(G)}\leq \sum\limits_{i=1}^k d_i^{\omega}

Intuitively, the idea of this theorem is that the problem of multiplying matrices of size |G|^{1/\alpha(G)} reduces to multiplication of two elements in \mathbb{C}[G] together, which itself reduces to a problem of multiplying a collection of matrices of size d_i\times d_i, each of which should take approximately d_i^\omega operations. So, multiplying two matrices of size |G|^{1/\alpha(G)} has complexity of order at most \sum_{i=1}^k d_i^{\omega}, which means that this sum is an upper bound for |G|^{\omega/\alpha(G)}. Actually proving the theorem is more difficult, so we won’t show it here, but this intuition is the most important part.

Importantly, notice that if \alpha(G) were equal to 2, then this would imply that \omega =2 (because \sum_i d_i^2 =|G|). However, \alpha(G)>2, so we need to control the character degrees of G to get non-trivial bounds on \omega. Define \gamma(G) to be the number such that |G|^{1/\gamma} is the maximum character degree of G. An important corollary of Theorem 2 is:

Corollary 1: Let G be a finite group. If \alpha(G)<\gamma(G), then

\omega\leq \alpha(G)\left(\frac{\gamma(G)-2}{\gamma(G)-\alpha(G)} \right)

Proof: Let \{d_i\} be the character degrees of G. Then, by theorem 4.1,

|G|^{\omega/\alpha(G)}\leq \sum_i d_i^{\omega-2}d_i^2\leq |G|^{(\omega-2)/\gamma(G)}\sum_i d_i^2\leq |G|^{1+(\omega-2)/\gamma(G)}

This implies \omega\left(\frac{1}{\alpha(G)}-\frac{1}{\gamma(G)}\right)\leq 1-\frac{2}{\gamma(G)}. Dividing by \frac{1}{\alpha(G)}-\frac{1}{\gamma(G)} (which is positive by assumption) gives us our result. \square

A special case of this corollary is given by

Corollary 2: If there exists a sequence G_1,G_2,... of finite groups such that \alpha(G_i)=2+o(1) as i\rightarrow \infty and \alpha(G_i)-2=o(\gamma(G_i)-2), then the exponent of matrix multiplication is 2.

While these are weaker versions of our original theorem, they only require knowledge of \gamma(G), rather than all of the character degrees. Finally, we end with an open problem in representation theory whose positive answer could lead to a solution to our original problem:

Open Problem: For integers n,m,p not all 1, does there exist a finite group that realizes \langle n,m,p\rangle and has character degrees d_1,...,d_k such that

nmp>\sum_{i}d_i^3

While this is not a necessary condition, a positive answer to this could lead to a proof that \omega=2 when combined with our theorem. This is because one limitation of our theorem is that unless we can rule out the case that |G|^{3/\alpha}\leq \sum_{i}d_i^3 in theorem 4.1, that theorem is satisfied trivially. So, for us to be able to show that a group gives us a non-trivial improvement, we have to show that for any n,m,p\in \mathbb N, there exists a group G realizing \langle n,m,p \rangle such that |G|^{3/\alpha}> \sum\limits_{i}d_i^3, which the open problem implies.

While it is an open problem whether these groups exist in general, there are a few concrete examples where such a group exists, such as the one that follows:

Let H=C_2\times C_2 \times C_2 and let G=H^2 \rtimes C_2, where the group action of C_2 on H^2 is given by switching the two factors. If we let z be the generator of C_2, we can write the elements of G as (a,b)z^i with a,b\in H and i\in \{0,1\}. Let H_1,H_2,H_3 be the three copies of C_n in H, viewed as subgroups, and let H_4:=H_1 for notation. Then, we can define S_1,S_2,S_3 as

S_i=\{(a,b)z^j:a\in H_i\setminus \{1\},b\in H_{i+1},j\in \{0,1\}\}

One can show, although the proof is long, that these sets satisfy the triple product property. Furthermore, the character degrees of G are all at most 2, since H^2 is an abelian subgroup of index 2 in G and the maximum character degree of a group is at most the index of any abelian subgroup in the group. Since the sum of the squares of the character degrees is |G|, the sum of the cubes of the degrees is at most 2|G|=4n^6. However, |S_i|=2n(n-1), meaning that for n\geq 5,

|S_1| |S_2| |S_3| =8n^3(n-1)^3\geq 4n^6= 2|G| \geq \sum\limits_{i}d_i^3

Math202C: Lecture 17 – Expander Graphs

Author: Haixiao Wang

As demonstrated in [1], expander graphs are a sparse graph that mimic the structure properties of complete graphs, quantified by vertex, edge or spectral expansion. At the same time, expander constructions have spawned research in pure and applied mathematics, with several applications to complexity theory, design of robust computer networks, and the theory of error-correcting codes.

This lecture is organized as follows. We introduce two different definitions of expanders in the first section, and then explore the relationship between them in second part.

17.1. Two Definitions


We start our discussion in expansion with two definitions of expansion: combinatorial and spectral. In the rest of this post, we consider an undirected graph G = (V, E) with |V| = n, where self loops and multiple edges are allowed. Given two subsets S,T \subset V, the set of edges between S and T is denoted by E(S, T) = \{(u,v ): (u, v) \in E, u\in S, v\in T\}.

17.1.1. Combinatorial expansion


Intuitively, a graph G=(V,E) is an expander if subsets S \subseteq V expand outwards into V \setminus S. In other words, all small subsets will have large boundary, respectively. Follow this intuition, we introduce the definition of boundary, based on edges and vertices respectively, to quantify the corresponding expansion.

Definition 1(Boundary):

  • The Edge Boundary of a set S, denoted by \partial S, is the set of edges \partial S = E(S, V \setminus S), which is the set of edges between disjoint set S and V \setminus S.
  • The Vertex Boundary of a set S is a subset of vertices in V \setminus S adjacent to vertices in S, denoted as N(S).

Obviously, vertex boundary |N(S)| is the same as the number of neighboring vertices of vertex sets S. With those notations, we then define the expansion over the entire graph G via maximizing over subsets S.

Definition 2(Expander): For some \varepsilon \in [0, 1], a graph G = (V, E) with |V| = n is an

  • \varepsilon-edge-expander if \min\limits_{ S\subset V, |S| \leq \frac{n}{2} } \frac{|\partial S|}{|E(S, V)|} \geq \varepsilon.
  • \varepsilon-vertex-expander if \min\limits_{ S\subset V, |S| \leq \frac{n}{2} } \frac{|N(S)|}{|S|} \geq \varepsilon.

Remark 3: The reason for the restriction |S| \leq n/2 is that we are examining the relative size of every cut in the graph, i.e., the number of edges crossing between S and V \setminus S, and we examine each case only once.

Problem 1: Show that the complete graph K_{n} is

  • at least a 1/2-edge-expander.
  • a 1-vertex-expander. (We only consider \varepsilon \in [0, 1] here.)

From the definition above, the maximum \varepsilon, which makes G = (V, E) an \varepsilon-edge-expander, is an important criteria, leading to the definition of Cheeger constant.

Definition 4(Cheeger Constant): The Edge Expansion Ratio, or Cheeger constant, of graph G = (V, E) with |V| = n, denoted by h(G), is defined as

h(G) = \min\limits_{S\subset V, |S| \leq \frac{n}{2}}\frac{|\partial S|}{|E(S, V)|}

Remark 5: Actually, Cheeger constant can be defined alternatively as \min\limits_{S\subset V, |S| \leq n/2}\frac{|\partial S|}{|S|} = O(n). We didn’t follow this definition since we want to keep \varepsilon\in [0, 1].

17.1.2. Spectral Expansion

To simplify our discussion, we focus on regular graphs in the rest of this lecture. Let A = A(G)\in \mathbb{R}^{n \times n} denote the adjacency Matrix of a d-regular graph G = (V, E) with |V| = n and A being real symmetric, then A has n real eigenvalues, denoted by \lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_n, and corresponding orthonormal eigenvectors v_1, \dots, v_n with A v_i = \lambda_i v_i. The spectrum of A encodes a lot of information about the graph structure. Recall the following properties, which we have seen in previous lectures.

  • The largest eigenvalue is \lambda_1 = d and the corresponding eigenvector is v_1 = (1/\sqrt{n}, \dots, 1/\sqrt{n}).
  • The graph G is connected if and only if \lambda_1 > \lambda_2.
  • The graph is bipartite if and only if \lambda_1 = -\lambda_n.

The definition of spectral expansion and spectral gap then follows.

Definition 6(Spectral Expansion): A graph G = (V, E) with |V| = n is a \lambda-spectral-expander if \lambda(G):= \max \{ |\lambda_2|, |\lambda_n| \} \leq \lambda.

In other words, the second largest eigenvalue of A(G), in absolute value, is at most \lambda. This also gives rise to the definition of Ramanujan graph.

Definition 7(Ramanujan graph): A connected d-regular graph G is a Ramanujan graph if \lambda (G)\leq 2\sqrt{d-1}.

We now refer to the spectral gap.

Definition 8(Spectral Gap): The spectral gap of d-regular graph G is \lambda_{1} - \lambda(G).

Remark 9: Sometimes the spectral gap is defined as \lambda_1 - \lambda_2, which is often referred as one-sided.

Problem 2:

  • Show that the complete graph K_{n} has spectral gap n - 2. Actually, we can also show that the one-sided spectral gap is n.
  • Show that the complete bipartite graph K_{m, n} has spectral gap 0. (Hint: use the spectral property above.)

17.2. Relating Two Definitions

It is not so obvious that the combinatorial and spectral versions of expansion are related. In this section, we will introduce two relations between edge and spectral expansion: the Expander-Mixing Lemma and Cheeger’s inequality.

17.2.1. The Expander-Mixing Lemma

Lemma 10(Expander-Mixing Lemma[2]): Let G = (V, E) be a d-regular graph with n vertices and eigenvalues \lambda_1 \geq \dots \geq \lambda_n. Then for all S, T \subset V:

\Bigg| |E(S, T)| - \frac{d|S|\cdot |T|}{n}\Bigg| \leq \lambda(G) \sqrt{|S|\cdot |T|}.

As we can see, the left-hand side measures the deviation between two quantities:

  • |E(S, T)|: the number of edges between the two subsets S and T.
  • |S|\cdot |T|\cdot d/n: the expected number of edges between two subsets S and T, drawn from an Erdos–Renyi random graph G(n, d/n).

The Expander-Mixing lemma tell us a small \lambda(G) (a.k.a. large spectral gap) implies that this discrepancy is small, leading to the fact that the graph is nearly random in this sense, as discussed in [1]. We should mention that the converse of expander-mixing is true as well, stated as follows.

Lemma 11(Expander-Mixing Lemma Converse [4]): Let G = (V, E) be a d-regular graph with n vertices and eigenvalues \lambda_1 \geq \dots \geq \lambda_n satisfying:

\Bigg| |E(S, T)| - \frac{d|S|\cdot |T|}{n}\Bigg| \leq \rho \sqrt{|S|\cdot |T|}.

for any two disjoint subsets S,T\subset V and some positive \rho. Then \lambda(G) \leq O(\rho \cdot(1 + \log(d/\rho))).

Proof of Expander-Mixing Lemma: Recall that the adjacency matrix A has eigenvalues \lambda_1 \geq \dots \geq \lambda_n with corresponding orthonormal eigenvectors v_1, \dots, v_n. Let 1_S and 1_T be the characteristic vectors of subsets S and T(a.k.a., the v-th coordinate of 1_S is 1 if v\in S and 0 otherwise). We then decompose 1_S and 1_T in the orthonormal basis of eigenvectors v_1, \dots, v_n as

1_S = \sum\limits_{i=1}^n \alpha_i v_i, \quad 1_S = \sum\limits_{j=1}^n\beta_j v_j.

Observing that |E(S, T)| = (1_S)^T A 1_T, we expand out the right hand side and apply orthogonality, which yields

(1_S)^T A 1_T = \left (\sum\limits_{i=1}^n \alpha_i v_i \right)^T A \left (\sum\limits_{j=1}^n\beta_j v_j \right) = \sum\limits_{i=1}^n \lambda_i\alpha_i\beta_i = d\frac{|S|\cdot|T|}{n} + \sum\limits_{i=2}^n \lambda_i\alpha_i\beta_i .

where the last step follows from the facts \lambda_1 = d with corresponding v_1 = (1/\sqrt{n},\dots, 1/\sqrt{n}), \alpha_1 =<1_S, v_1> = |S|/\sqrt{n} and \beta_1 =<1_T, v_1> = |T|/\sqrt{n}. Shifting the first term on right hand side to the left, the desired result follows from Cauchy-Schwarz inequality:

\left| |E(S, T)| - d\frac{|S||T|}{n} \right | = \left|\sum\limits_{i=2}^n\lambda_i\alpha_i\beta_i\right| \leq \lambda(G) \sum\limits_{i=2}^n |\alpha_i \beta_i| \leq \lambda(G) ||\alpha||_2 ||\beta ||_2 = \lambda(G) ||1_S||_2  ||1_T||_2 = \lambda(G)\sqrt{|S||T|}.

Remark 12: If we normalize the adjacency matrix A = A(G) and obtain \bar{A} = (1/d) A, where each entry of A is divided by d, then the corresponding eigenvalues of \bar{A}, namely \bar{\lambda}_{1}\geq \dots \geq \bar{\lambda}_n, lie in the interval [-1, 1]. The Expander-Mixing Lemma is in the form of

\Bigg| |E(S, T)| - \frac{d|S|\cdot |T|}{n}\Bigg| \leq \bar{\lambda}(G)d \sqrt{|S|\cdot |T|}.

It is sometimes convenient to consider the normalized spectrum. For example in random graph theory, it would be convenient to look at the normalized spectrum when the expected degree d = O(\log n). However, we can focus on the spectrum of $\latex A = A(G)$ without normalization when the expected degree $\latex d = O(1)$.

Regular \lambda-spectral expanders with small \lambda(G)(a.k.a. large spectral gap) have some of significant properties, listed as follows.

Corollary 13: An independent set , in a graph G = (V, E) with |V| = n, is a set of vertices S, where no two vertices in S are adjacent, i.e. with |E(S, S)| = 0. An independent set S, in a d-regular \alpha-spectral expander G, has cardinality at most \alpha n, i.e., |S|\leq \alpha n, which follows immediately from Lemma 10(Expander-Mixing Lemma[2]).

Corollary 14: Given a graph G = (V, E) with |V| = n, the diameter of G is defined as \mathrm{diam}(G) = \max\limits_{u, v}d_{G}(u, v), where d_{G}(u, v) is the length of shortest path between u and v. If G is a d-regular \alpha-spectral expander G with \alpha < d/2, then we have

\Omega \Big( \frac{\log n}{\log d} \Big) \leq \mathrm{diam}(G) \leq O(\log n).

17.2.2. Cheeger’s Inequality

As we have seen from previous discussion, spectral expansion implies strong structural properties on the edge-structure of graphs. In this section, we introduce another famous result, Cheeger’s inequality, indicating that the Cheeger constant(Edge Expansion Ratio) of a graph G can be approximated by the its spectral gap \lambda_1 - \lambda(G).[6]

Theorem 15(Cheeger’s inequality, [3]): Let G = (V, E) be a d-regular graph with eigenvalues \lambda_1\geq \cdots \geq \lambda_n, then

\frac{d - \lambda_2}{2} \leq \min\limits_{S\subset V, |S|\leq n/2} \frac{|\partial S|}{|S|} \leq \sqrt{2d(d - \lambda_2)}.

In the normalized scenario \bar{A} = (1/d)A(G) with eigenvalues \bar{\lambda}_{1}, \cdots, \bar{\lambda}_n \in [-1, 1], we have

\frac{1 - \bar{\lambda}_2}{2} \leq h(G):=\min\limits_{S\subset V, |S|\leq n/2} \frac{|\partial S|}{|E(S, V)|} \leq \sqrt{2(1 - \bar{\lambda}_2)}.

Remark 16: The proof for the normalized scenario is different from the regular scenario. We should start from the desired quantity and finish the calculation step by step. However, the strategy are the same. We use the Rayleigh Quotient to obtain upper bound and lower bound.

In the context of n\gg d, the estimate of spectral gap was given by Nilli Alon.

Theorem 17([5]): For every d-regular graph G, we have

\lambda(G) \geq 2\sqrt{d - 1} - o_{n}(1)

The o_{n}(1) term is a quantity that tends to zero for every fixed d as n \to \infty.

[1]. Shlomo Horry, N. L. and Wigderson, A. (2006). Expander graphs andtheir applications.BULLETIN (New Series) OF THE AMERICAN MATHEMATICAL SOCIETY, 43(4).

[2]. Alon, N. and Chung, F. R. K. (1988). Explicit construction of linear sized tolerantnetworks.Discrete Mathematics, 42

[3]. Cheeger, J. (1970). A lower bound for the smallest eigenvalue of the laplacian.Problems inanalysis, 26.[Nilli, 1991] Nilli, A. (1991). On the secon

[4]. Yonatan Bilu, N. L. (2006). Lifts, discrepancy and nearly optimal spectral gap.Com-binatorica, 26

[5]. Nilli, A. (1991). On the second eigenvalue of a graph.Discrete Mathematics, 91(2):207–210.

[6]. Lovett, S. (2021). Lecture notes for expander graphs and high-dimensional expanders.

Lecture 16 – PageRank

Author: Jiyoung Choi

The PageRank is a search algorithm based on the linkage structure of the web, which Larry Page and Sergey Brin introduced. They built the search engine Google to test the utility of PageRank for search. (Origin of the name “Google”)

The goal of this lecture is to construct Google matrix and explore how to find PageRank vector. Let’s start by making an adjacency matrix based on web linkage.

Let u be a webpage, and let F_u be the set of pages u points to and B_u be the set of pages that point to u. Consider each webpage as vertices and link as directed edge. We can construct an adjacency matrix A_{uv} = 1 if there exist links from v to u, otherwise A_{uv}=0 where u, v are webpages. For example, if webpages have linkage in Figure 1, A is defined as follows.

Figure 1. A simple example consists of 5 webpages


A= \begin{bmatrix} 0 & 1 & 1 & 1 & 0\\ 1 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\end{bmatrix}

We can observe that the last column of A is a zero vector because v_5 does not have forward links. We call such a node a dangling node.

Let N_u = |F_u| be the number of links from u. Then column normalized adjacency matrix R of A is defined by R(u) = {A(u) \over N_u}. R(u) denotes column corresponding page u. In our example,


R= \begin{bmatrix} 0 & 1 & 1 \over 2& 1 \over 3& 0\\1 \over 2 & 0 & 0 & 1 \over 3& 0\\0 & 0 & 0 & 1\over 3 & 0\\1\over 2 & 0 & 0 & 0 & 0\\0 & 0 & 1 \over 2& 0 & 0\end{bmatrix}.


This matrix R is not directly used in the PageRank. To compute PageRank vector, we will use the power method, an iterative procedure, has convergence problem. The procedure can cycle, or the limit may be dependent on the starting vector. To avoid this problem, we need to have a left stochastic matrix.

Like in our example, it can be possible that R is not stochastic because of dangling nodes. Therefore, define stochastic matrix R' = R + {\mathbf{e}\mathbf{a}^T \over n}. where \mathbf{e} is a vector of all ones, n is the number of web pages, and \mathrm{a}(u)=1 if R(u) =\mathbf{0}, otherwise \mathrm{a}(u)=0.

When you do web surfing, though there are no forward links (dangling nodes), we may visit other new web pages. The {\mathbf{e}\mathbf{a}^T \over n} matrix add the situation with uniform probability for all web pages.

Problem 1: Find R' for our example.

Then by the Gershgorin circle theorem, the largest eigenvalue of R' is less than or equal to 1. Since det(S-I)=0, we know there is the largest eigenvalue \lambda_n=1. We want to find the dominant eigenvector of R', which is PageRank vector. For finding eigenvector, PageRank uses Power Method.

Some of you may already realize, finding PageRank vector is finding stationary distribution (See Lecture 5, Lecture 6) of random walks on the web graph. However, if the stationary distribution vector is not unique, we cannot get the exact PageRank vector. This problem can be solved by Perron–Frobenius theorem. If R' is irreducible, eigenvector corresponding eigenvalue 1 is unique by Perron-Frobenius.

Definition 1: A matrix A is primitive if it is non-negative and its kth power is positive for some natural number k.

Definition 2: An n \times n matrix A is a reducible matrix if for some permutation matrix P, such that P^TAP is block upper triangular. If a square matrix is not reducible, it is said to be an irreducible matrix.

For more knowledge related to primitive and irreducible, see [6].

Problem 2: Every primitive matrix is irreducible.

We can construct a primitive matrix G = mR' + (1-m)E where 0 \le m \le 1 and E = {\mathbf{e}\mathbf{e}^T \over n}. In reality, Google use m=0.85 ([1], [7]). Since G is primitive, so it is irreducible, we can get the unique PageRank vector by Power Method. We call this matrix G as Google matrix. Such m is called damping constant, this reflects a person who is randomly clicking on links will eventually stop clicking. In our example, G is


G= \begin{bmatrix}3 \over 100 & 22 \over 25 & 91 \over 200& 47 \over 150& 1 \over 5\\91 \over 200 & 3 \over 100 & 3 \over 100 & 47 \over 150& 1 \over 5\\3 \over 100 & 3 \over 100 & 3 \over 100 & 47\over 150 & 1 \over 5\\91 \over 200 & 3 \over 100 & 3 \over 100 & 3 \over 100 & 1 \over 5\\3 \over 100 & 3 \over 100 & 91 \over 200& 3 \over 100 & 1 \over5\end{bmatrix}.


when m=0.85.

Here, we call the unique largest eigenvalue \lambda of G as Perron root (for Google matrix \lambda = 1), and \mathbf{x} is Perron vector if G\mathbf{x}=\lambda \mathbf{x} and ||\mathbf{x}||_1 = 1. G has unique dominant eigenvalue implies that there is the unique PageRank vector. (If you want to check convergence, see [4], p. 179)

Finally, Power Method on G ([2]), recursively compute \mathbf{x}_k = G\mathbf{x}_{k-1}, finds the limit of \mathbf{x}_k, say \mathbf{x}. This \mathbf{x} is PageRank vector. That is, PageRank of page u is \mathbf{x}(u) where \mathbf{x} is Perron vector and ||\mathbf{x}_1||=1. Note that \mathbf{x} is a probability vector reflects the possibility of click each webpage.

References

[1] Amy Langville and Carl Meyer (2003). Fiddling with PageRank.

[2] Amy Langville, Carl Meyer (2004). The Use of the Linear Algebra by Web Search Engines. IMAGE Newsletter. 33.

[3] Amy Langville and Carl Meyer (2004). Deeper Inside PageRank. Internet Mathematics. 1. 10.1080/15427951.2004.10129091.

[4] Amy Langville and Carl Meyer (2006). Google’s PageRank and Beyond: The Science of Search Engine Rankings

[5] Larry Page, Sergey Brin, R. Motwani, and T. Winograd (1998). The PageRank Citation Ranking: Bringing Order to the Web.

[6] Roger Horn and Charles Johnson (1985). Matrix Analysis.

[7] Taher Haveliwala and Sepandar Kamvar (2003). The Second Eigenvalue of the Google Matrix.

Lecture 15 – Spectral Theory of Hypergraphs

Definition 1. A hypergraph \Gamma is a pair \Gamma = (V,E) where V is a finite set and E\subset 2^V is a nonempty collection of subsets of V. \Gamma is called k-uniform if
E\subset{V\choose k}:=\{X\subset V:|X| = k\}. \Gamma is called a graph if it is 2-uniform.

Our goal for this lecture is to explore the rudiments of the spectral theory of k-uniform hypergraphs. The natural starting point, if only for inspiration, is the case of graphs. Letting \Gamma = (V,E) be a graph, and identifying V = \{1,2,\dotsc,n\}, we can begin by writing down the adjacency matrix A = (A_{ij}) of \Gamma defined by


A_{ij}=\begin{cases} 1&\text{ if }i\sim j\\ 0&\text{ otherwise} \end{cases}, \hspace{1cm}1\leq i,j\leq n.


where i\sim j\iff \{i,j\}\in E. In this case i,j are said to be adjacent.

The spectral theory of A is rich. For example, it made an appearance in our 202C class for the special case where \Gamma is the Cayley graph of a group. Our main focus however will be the matrix L = (L_{ij}), defined by

L_{ij}=\begin{cases} d_{i}&\text{ if }i=j\\ -1&\text{ if }i\sim j\\ 0&\text{ otherwise} \end{cases}.

where the degree d_i is the number of edges containing i\in V. If D is the diagonal matrix of degrees, then we may also write L = D-A. This matrix L is called the Laplacian or Kirchoff matrix of \Gamma. The terminology arises from the fact that as an operator on \mathbb{C}^V, L displays various properties canonically associated with the Laplacian on \mathbb{R}^n; e.g., mean value property, maximum principle, etc.. A very interesting paper of Wardetsky, Mathur, Kälberer, and Grinspun[6] explores these connections in detail, but they are outside of the scope of this lecture.

A few facts are worth collecting here- first, L is a symmetric, positive semidefinite matrix in M_n and hence admits nonnegative eigenvalues 0=\lambda_0\leq\lambda_1\leq\cdots\leq\lambda_{n-1}. L always has zero as an eigenvalue, for example, via the vector \mathbf{1} = (1,\dotsc,1)^T. But the multiplicity of said eigenvalue could be higher, depending on the number of connected components (Exercise 1). However when we assume \Gamma is connected (i.e. contains a single connected component), it follows that \lambda_1>0. In much of the literature, we are often interested in the interplay and connections between geometric and combinatorial properties of \Gamma– things like its size, connectedness, density, or sparseness, for example; and the first eigenvalue \lambda_1 (either of L or one of its many variants). For example…

Exercise 1.
(a) Show that the geometric multiplicity of the null eigenvalue of L is the number of connected components of \Gamma.
(b) Show \lambda_1\leq n, with equality if and only if \Gamma is the complete graph. Hint: if \Gamma is not complete, relate its Laplacian and that of its “complement” (suitably defined) to the Laplacian of the complete graph, and use eigenvalue interlacing.

For a more detailed introduction to spectral graph theory, see Fan Chung’s book[1, Ch. 1].

Our next goal is to try and take these notions and structures and construct extensions, or analogues, for the case where \Gamma is a k-uniform hypergraph. As we shall see, this is not so easy. Consider if you desire the following perspective from simplicial homology (those unfamiliar or otherwise uninterested, skip to Definition 2).

One approach would be to try and construct L not as a specific matrix, but as a descendant of an invariant of the topology of the underlying graph; which we could then hope admits a generalization to the case where the graph is instead a hypergraph. Specifically, the idea is to look at the homology of \Gamma and build a Laplacian from its boundary operator- as follows.

Let \Gamma be a graph and enumerate E = \{e_1,\dotsc,e_m\}. To each unordered pair e_j=\{v,w\}\in E pick an orientation (v,w) or (w,v) and let \widetilde{E} be the set of all such oriented edges. Let S = (S_{v,e}) \in M_{n\times m} be the matrix whose rows are indexed by v\in V and whose columns are indexed by the ordered edges e\in\widetilde{E}, defined by


S_{v,e} = \begin{cases} 1&\text{if }e = (v,\cdot)\\ -1&\text{ if }e=(\cdot,v)\\ 0&\text{ otherwise} \end{cases}.


It is left as an (unassigned) exercise to verify that L=SS^T; and in fact that the choice of orientations/signs in the preceding setup is completely arbitrary. Thinking of S as a homomorphism mapping elements of the module \mathbb{Z} E to the module \mathbb{Z} V by matrix multiplication, we recognize S as the boundary operator that is related to the homology groups of \Gamma as a simplicial complex (see [3, Section 2.1, page 105]) and thus we have constructed L as an invariant of the topology of \Gamma (up to choices in the orientations of the edges).

But as it turns out, the capricious choice of orientation we made earlier to define S is what makes graphs special. As soon as we allow 3-edges, or 1-edges for that matter, it is not at all obvious how to “orient” them, or how to distinguish between orientations (compare to the natural symbolic relationship (v,w)=-(w,v)). This, in fact, is a critically important distinction between graphs and hypergraphs. The answer is that we need more than linear operators, and so we look to tensors. The exposition that follows roughly follows two papers of Qi[4,5].

Definition 2. A tensor \mathcal{T} of order m and dimension n over a field \mathbb{F} is a multidimensional array

\mathcal{T} = (t_{i_1i_2\cdots i_m}),\hspace{1cm}1\leq i_1,\dotsc,i_m\leq n.

where each entry t_{i_1i_1\cdots i_m}\in\mathbb{F}. We call \cal{T} an (m,n)-tensor over \mathbb{F}.

We can think of an (m,n)-tensor as a hypercubic array of size n\times n\times\cdots\times n, m times). All of the tensors presented here are understood to be over \mathbb{C}. The case (2,n) recovers the usual n\times n square matrices. If x\in\mathbb{C}^n then we define the product \mathcal{T}x^{m-1}\in\mathbb{C}^n by

(\mathcal{T}x^{m-1})_i = \sum_{i_2,\dotsc,i_m=1}^n t_{ii_2\cdots i_m}x_{i_2}\cdots x_{i_m},\hspace{1cm}1\leq i\leq n.

(The expression \mathcal{T}x^{m-1}\in\mathbb{C}^n is conventional notation.) Also as a matter of notation, for x\in \mathbb{C}^n and k\geq 0, let x^{[k]}\in\mathbb{C}^n be given by (x^{[k]})_i = x_i^k. There are lots of analogous notions and properties for tensors as compared to usual matrices; for example, we have the following formulation of the spectrum of a tensor.

Definition 3. Let \cal{T} be an (m,n)-tensor. Then we say \lambda\in\mathbb{C} is an eigenvalue of \cal{T} if there exists a nonzero vector x\in\mathbb{C}^n, called an eigenvector of \mathcal{T}, which satisfies \mathcal{T} x^{m-1} = \lambda x^{[m-1]}.

Notice that no (k,n) tensor \mathcal{T} for k\geq 3 defines a true linear operator on \mathbb{C}^n, and hence we may not a priori speak of eigenspaces, kernels, etc. and their linear algebraic structure. (These notions do have meaning, but they require the tools of algebraic geometry to properly formulate, and are outside of our scope.) Nevertheless, we can construct tensors associated to k-uniform hypergraphs that will give rise to various eigenvalues. And if so, can we find relationships between the geometry of a hypergraph and its eigenvalues?

Definition 4. Let \Gamma=(V,E) be a k-uniform hypergraph on n vertices. The adjacency tensor of \Gamma is the (k,n)-tensor \mathcal{A} = (a_{i_1\cdots i_k}) defined as follows. For each edge e = \{i_1,\dotsc,i_k\}\in E,

a_{i_1i_2\cdots i_k} = \frac{1}{(k-1)!},

and likewise for any rearrangment (i_1',\dotsc,i_k') of the vertices in e. The values of \mathcal{A} are defined to be zero otherwise. The degree tensor of \Gamma is the (k,n)-tensor \mathcal{D}=(d_{i_1\cdots i_k}) defined by

d_{i\cdots i} = d_i,

where d_i is the number of edges containing i, and the values of \mathcal{D} are defined to be zero otherwise. The Laplacian tensor is the (k,n) tensor \mathcal{L}= (l_{i_1\cdots i_k}) defined by \mathcal{L} = \mathcal{D}-\mathcal{A}.

A brief remark- our choice to focus on k-uniform hypergraphs is merely to simplify the exposition. There are well-developed versions of Definition 4 for general hypergraphs[2], but they are harder to write down and explore in this setting. With these fundamental notions established, the goal for the remainder will be to write down and explore a few basic propositions regarding the eigenvalues and eigenvectors of the tensors of a hypergraph \Gamma.

Proposition 1. Let \Gamma be a k-uniform hypergraph, with k\geq 3, on n vertices and \mathcal{L} its Laplacian tensor.

  • Let \mathbf{e}_j\in\mathbb{C}^n denote the j-th standard basis vector. Then \mathbf{e}_j is an eigenvector of \mathcal{L} with eigenvalue d_j.
  • The vector \mathbf{1} occurs as an eigenvector of \mathcal{L} with eigenvalue 0.

Proof. This is an exercise in calculations with tensors. For the first claim let 1\leq j\leq n be fixed, and notice that for any m\geq 1, \mathbf{e}_j^{[m]} = \mathbf{e}_j and hence for each 1\leq i\leq n, it holds

Notice that a_{i_1\cdots i_k} = 0 if any of the i_k‘s are repeated, and that d_{ij\cdots j} = d_i if i=j and 0 otherwise. It follows that \mathcal{L}\mathbf{e}_j^{k-1} = d_j\mathbf{e}_j^{[k-1]}. For the second claim, again we have \mathbf{1}^{[m]} = \mathbf{1} for any m, and hence

where, with some abuse of notation in the second line, we emphasize that the sum contains all permutations of the k-1 terminal points in each edge e=\{i,i_2,\dotsc,i_k\}\in E. -QED.

Exercise 2. Let \Gamma be a k-uniform hypergraph and \mathcal{A} its adjacency tensor. Show that \mathbf{e}_j occurs as an eigenvector of \mathcal{A} with eigenvalue 0 for each j. Sanity check: why does this not imply that \mathcal{A}=0?

Our next proposition is a statement about the distribution of eigenvalues.

Proposition 2. Let \Gamma be a k-uniform hypergraph on n vertices, and \mathcal{L} its Laplacian tensor. Let \Delta denote the maximum degree of \Gamma. Then all of the eigenvalues of \mathcal{L} lie in the disk \{\lambda\in\mathbb{C}:|\lambda-\Delta|\leq\Delta\}.

Proof. ([4, Thm. 6]) Let \lambda\in\mathbb{C} be an eigenvalue for \mathcal{L}, and let x\in\mathbb{C}^n be a corresponding eigenvector. Assume we have

|x_i| = \max_{j=1,\dotsc,n}|x_j| \neq 0.

Then by definition, it holds \mathcal{L}x^{m-1} = \lambda x^{[m-1]}, which in the i