Hamming cube of primes

Given two nonnegative integers, m and n, we say that they are Hamming-adjacent if and only if their binary expansions differ in exactly one digit. For example, the numbers 42 and 58 are Hamming-adjacent because their binary expansions 101010 and 111010 differ in a single position.

If m and n are Hamming-adjacent, then their absolute difference |n − m| is necessarily a power of 2. The converse is not true, though; 24 and 32 have a difference of 8, but their binary expansions 011000 and 100000 differ in three positions.

We can form a countably infinite graph G on the set of all nonnegative integers by connecting two vertices if and only if they’re Hamming-adjacent. G is a bipartite graph: two integers can only be Hamming-adjacent if one is odious and the other is evil.

G is the union of nested hypercube graphs: the first 2^d nonnegative integers form a hypercube graph of degree d. For example, here’s the subgraph induced by the first 32 naturals:

What about if we take the induced subgraph on only the vertices which are primes? For the primes below 32, we get the following graph:

It’s a connected graph! It remains connected if we repeat for the primes below 64:

What about for the primes below 128? Now it has an isolated vertex, 127, and an isolated edge:

When we go further and look at the primes below 512, the vertex 127 is no longer isolated; it has been connected into the main component via the prime 383:

Similarly, the edge between 89 and 73 becomes assimilated into the main component once we increase our horizon to the primes below 1024.

This raises the question: does every prime eventually become connected to the main component? Or, equivalently, when we form the countably infinite induced subgraph H of G whose vertices are all of the primes, is H connected?

Isolated vertices

It turns out that the answer is no: the prime 2131099 is an isolated vertex in H with no neighbours whatsoever! (It is plausibly the smallest such example.)

How does one even prove that 2131099 is an isolated vertex? In other words, how do we show that all of the Hamming-adjacent integers are composite? Firstly, note that the Hamming-adjacent integers smaller than itself are a finite set obtained by subtracting one of the powers of 2 in its binary expansion:

33947, 2098331, 2130075, 2130971, 2131083, 2131091, 2131097, 2131098

These can all be verified to be composite. But what about the infinitely many Hamming-adjacent integers that are larger than itself? These are necessarily of the form $2131099 + 2^k$ for some value k. It transpires that every element in this set must be divisible by at least one of the primes {3, 5, 7, 13, 17, 241}, called the covering set. In particular, we have:

  • k = 1 (mod 2) implies divisible by 3;
  • k = 0 (mod 4) implies divisible by 5;
  • k = 1 (mod 3) implies divisible by 7;
  • k = 2 (mod 12) implies divisible by 13;
  • k = 2 (mod 8) implies divisible by 17;
  • k = 6 (mod 24) implies divisible by 241;

and together these cover all residue classes modulo 24.

We can go further and show that there are infinitely many such isolated vertices. In particular, every number of the following form:

n = 1885107369798300628981297352055 h + 3316923598096294713661

has a covering set of primes for all numbers of the form n \pm 2^k, as discovered by Christophe Clavier. Specifically, there are two covering sets of primes: one for the ‘+’ case and one for the ‘−’ case; the union of these two sets is just the set:

{3,5,7,11,13,17,19,31,37,41,73,97,109,151,241,331,673,1321}

of prime divisors of the linear coefficient.

So, we just need to show that there are infinitely many primes of this form that are not Hamming-adjacent to any of the primes in the above set. The latter condition is easy to enforce — for example, by insisting that n is congruent to 4095 mod 4096 (i.e. its binary expansion ends in twelve ‘1’s). Then we are left with an arithmetic progression whose offset and common difference are coprime, and a theorem of Dirichlet states that there are infinitely many primes in such an arithmetic progression.

A difficult conjecture

Given the existence of infinitely many isolated vertices, we postulate the following conjecture:

Conjecture: Other than the infinitely many isolated vertices, there is exactly one connected component in H.

What would a counterexample look like? The smallest such counterexample would be a pair of Hamming-adjacent primes with p > q and no further neighbours. The usual method of covering sets would only work under the following set of conditions:

  1. p has a covering set P preventing any larger Hamming-adjacent primes;
  2. q has a covering set Q preventing any larger Hamming-adjacent primes, with the single exception of p, and this is only possible if p is itself a member of Q;
  3. The finitely many numbers smaller than and Hamming-adjacent to p all happen to be composite, with the single exception of q;
  4. The finitely many numbers smaller than and Hamming-adjacent to q all happen to be composite.

The second of these conditions is incredibly implausible, because the primes appearing in a covering set are usually much smaller than the dual Sierpinski numbers that arise from such a covering set. Here we are asking for the prime in the covering set to be larger!

Note that this conjecture is definitely unsolved, because it’s stronger than the conjecture that there are primes of every Hamming weight.

Posted in Uncategorized | 2 Comments

Determinacy

I’d like to take this opportunity to highly recommend Oscar Cunningham’s blog. One of the posts, entitled A Better Representation of Real Numbers, describes an elegant order-preserving* bijection between the nonnegative reals [0, ∞) and ‘Baire space‘, \mathbb{N}^{\mathbb{N}}, the space of infinite sequences of nonnegative integers.

*where the order on the Baire space is defined by lexicographical order.

Ultimately, it is built up recursively from just two ideas:

  1. There is an order-preserving homeomorphism between [0, 1) and [0, ∞), namely xx/(1 − x).
  2. [0, ∞) can be partitioned into countably many copies of [0, 1), stacked end-to-end, one for each natural number.

Baire space and [0, ∞) both admit natural topologies (the product topology and the usual topology, respectively) and this order-preserving bijection is continuous in one direction: from Baire space to the nonnegative reals.

If we allow the first element of the integer sequence to be negative, then this gives an order-preserving bijection between the entirety of \mathbb{R} and the space \mathbb{Z} \times \mathbb{N}^{\mathbb{N}}. Rather similarly to the better known continued fraction representation, rationals are mapped to eventually-zero sequences and quadratic irrationals to eventually-periodic sequences.

Gale-Stewart games

Given a subset X of Baire space, there is an associated two-player game (the Gale-Stewart game G(X)) where players alternately exclaim nonnegative integers:

  • Player A plays a nonnegative integer z0;
  • Player B plays a nonnegative integer z1;
  • Player A plays a nonnegative integer z2;
  • Player B plays a nonnegative integer z3;

and so on, ad infinitum. Player A wins if the sequence (z0, z1, z2, z3, …) belongs to the set X; Player B wins otherwise.

A game G(X) and the corresponding set X are described as determined if one of the players has a winning strategy. To be precise, a strategy is a function from finite initial segments of play to the nonnegative integers, specifying which move one makes after seeing a particular sequence of moves.

It is somewhat unintuitive that non-determined games can exist: after all, it’s impossible for a game to end in a draw, so one may naively conclude that ‘under perfect play’ the game will be won by A or by B. However, ‘perfect play’ is not well defined for these infinite games, and it’s possible to have an undetermined game with the following properties:

  • for every strategy of player A, there exists a player B strategy that beats it;
  • for every strategy of player B, there exists a player A strategy that beats it;

and these are perfectly consistent statements. In ZFC, it’s possible to build an undetermined game using transfinite induction. The idea is to iterate over the tuples of the form (player, strategy, initial_segment_of_play) and to ‘foil’ that strategy by fixing the game outcome to be a loss for that player if that player follows the strategy and the other player plays some other strategy. This is possible because the other player can force the game to hit any of |\mathbb{R}| different outcomes, and thus far in the transfinite induction we’ve fixed strictly fewer than that many points in the Baire space, so there’s always a game outcome that we haven’t yet fixed and can freely set to be a win for the other player. (At the end of this transfinite induction, we likely only have a partially-specified game, but that’s fine; we just set the remaining outcomes arbitrarily, e.g. to be wins for Player B.)

Surprisingly, it’s consistent with ZF (sans choice) that every Gale-Stewart game is determined. This is called the axiom of determinacy. It’s false in ZFC, for the reasons stated above, but weaker versions of this axiom can be proved in ZFC. For instance:

Every open subset of Baire space is determined

and, more generally:

Every Borel subset of Baire space is determined

Consequences of large cardinals

Assuming certain large cardinal axioms, it is possible to show that various generalisations of Borel sets are also determined.

One example of a large cardinal axiom is the existence of a measurable cardinal, an uncountable cardinal S such that there exists a two-valued measure \mu : \mathbb{P}(S) \rightarrow \{0, 1\} satisfying:

  • μ(A) = 0 if and only if μ(S \ A) = 1;
  • if μ(A) = 0 and B is a subset of A, then μ(B) = 0;
  • singletons are small, i.e. μ({p}) = 0 for all points p in S.
  • the union of a collection of strictly fewer than |S| sets of measure 0 again has measure 0.

Without the ‘uncountable’ constraint, \aleph_0 would be measurable because this is equivalent to the definition of a non-principal ultrafilter.

If such a measurable cardinal exists, then it follows that every analytic set (continuous image of a Borel set) is determined, as is every coanalytic set (complement of an analytic set). A proof of this is here; I first encountered it in a 2015 graduate course on Descriptive Set Theory by Adrian Mathias.

Analytic and coanalytic sets together form the 1st level of the projective hierarchy; the 2nd level consists of projections of coanalytic sets and complements thereof; and so forth. (The Borel sets can be viewed as the 0th level of the projective hierarchy.)

If there are n Woodin cardinals with a measurable cardinal above them, then every set in the (n+1)th level of the projective hierarchy is determined. If there are infinitely many Woodin cardinals, then this holds for every n, and therefore every projective set is determined. This was proved in a paper by Martin and Steel.

Finally, if there’s a measurable cardinal above this infinite sequence of Woodin cardinals, then L(\mathbb{R}) (the constructible closure of the reals) satisfies the axiom of determinacy.

Consequences of determinacy

Determinacy has been formulated in terms of infinite games, but it is also of interest in real analysis because it implies various regularity properties about sets of reals. In particular, if a pointclass (family of sets) \mathcal{A} is closed under continuous preimages, such as those we have just considered, then every set in \mathcal{A} being determined also implies that they:

For example, if the axiom of determinacy holds in L(\mathbb{R}), then all three of these regularity properties hold for every set in \mathbb{P}(\mathbb{R}) \cap L(\mathbb{R}).

In Strong Axioms of Infinity and the Search for V (definitely worth reading), W. Hugh Woodin makes a strong case in favour of projective determinacy: projective sets are precisely the relations that can be formulated in second-order arithmetic, and assuming projective determinacy allows one to settle most natural questions in second-order arithmetic.

Posted in Uncategorized | Leave a comment

One-way permutations

One-way permutations are fascinating functions that possess a paradoxical pair of properties. They are efficiently computable functions φ from [n] := {0, 1, …, n−1} to itself that are:

  • Mathematically invertible, meaning that φ is a bijection;
  • Cryptographically uninvertible, meaning that it’s computationally infeasible in general to compute x given φ(x).

We’ll discuss a couple of constructions of one-way permutations. The first construction is simpler to understand, but is less secure (for a given size of n) or less efficient (for a given security level).

Construction I: using modular arithmetic

Let p be a large Sophie Germain prime; that is to say, a prime such that q = 2p + 1 is also prime. Then the field \mathbb{F}_q has a cyclic multiplicative group of 2p elements, and we can find a generator g for this multiplicative group.

Letting n = q − 1 = 2p, we can define a one-way permutation from [n] to itself as follows:

φ(x) := g^x − 1

The function g^x is a bijection from [n] to the set of nonzero elements modulo q, and the offset of − 1 maps this back to the interval [n]. It is easy to compute in the forward direction, by repeated squaring, but is intractible to invert when n is sufficiently large. The problem of inverting modular exponentiation is the discrete logarithm problem, and the largest safe prime for which this problem has been broken is 768 bits long.

To achieve a security level of 128 bits, NIST recommends using a 3072-bit value of n. The reason that this is so large is because the discrete logarithm problem is susceptible to index calculus attacks.

Attempting to use elliptic curves

Recall that an elliptic curve is a nonsingular cubic algebraic curve. By using an appropriate change of variables, it can be written in the Weierstrass normal form as the equation y² = x³ + ax + b, where the nonsingularity condition is equivalent to the constraint that the discriminant 4a³ + 27b² is nonzero.

Elliptic curves admit a group operation. Specifically, to add two points A and B, do the following:

  • draw a line through A and B, and let it intersect the curve again at C;
  • draw a line through C and the point at infinity, and let it intersect the curve again at another point (A+B).

This operation is manifestly commutative, has an identity (the point at infinity), and admits inverses. The operation is also associative; as mentioned before, this can be proved using Cayley-Bacharach.

In cryptography, it’s customary to work with elliptic curves over finite fields such as \mathbb{F}_p. A theorem of Hasse states that the number |E| of points on the elliptic curve E is approximately equal to p, differing by O(sqrt(p)). Particularly useful is when the group of points on the curve is cyclic (a sufficient condition is for |E| to be squarefree), as then we can find a generator point G and formulate an analogue of the discrete logarithm problem.

For a given group size, the elliptic curve discrete logarithm problem is more difficult than the ordinary discrete logarithm problem, since for general elliptic curves there are no known algorithms faster than the Pollard rho algorithm. Note that some curves are much less secure than others; Daniel J. Bernstein has a website recommending how to choose a good curve.

Instead of the group size of 2^3072 recommended for NIST to provide 128-bit security for the ordinary discrete logarithm problem, it suffices to have an elliptic curve with group size approximately 2^256. The security of the digital signatures in the Bitcoin network are based on the intractability of solving the discrete logarithm problem on a particular elliptic curve (called secp256k1) of approximately this size.

Unfortunately, there is no easy way to bijectively map the points of this elliptic curve to an interval of the correct size. It was possible with the previous group, Construction I, thanks to the fact that the multiplicative subgroup of \mathbb{F}_q is exactly the interval {1, 2, 3, …, q − 1}, and that can be mapped to the standard interval {0, 1, 2, …, q − 2} by subtracting 1. But the same thing cannot be done for elliptic curves because the usual way of representing points (x, y) as the pair of the x-coordinate and the sign of y is not bijective: roughly half of the numbers in {0, 1, …, p − 1, ∞} cannot occur as valid values of x because the corresponding values of x³ + ax + b are quadratic nonresidues (so cannot equal y²).

There is a class of elliptic curves for which we can efficiently biject the points with an interval, namely supersingular elliptic curves, but unfortunately this just ends up as a disguised version of Construction I: the elliptic curve discrete logarithm problem for these curves is reducible to discrete logarithm modulo p.

Construction II: twists to the rescue!

If d is a quadratic nonresidue, then the following pair of elliptic curves:

  • E: y² = x³ + ax + b;
  • E‘: y² = d(x³ + ax + b);

are known as quadratic twists of each other. The values of x which lie on E‘ precisely fill in the ‘holes’ formed by the values of x which don’t lie on E, because d(x³ + ax + b) is a quadratic residue precisely when x³ + ax + b is not! (Some extra care is needed to handle the points where y² is 0 or ∞.)

Specifically, we have:

|E| + |E‘| = 2p + 2

and, importantly, we have an explicit bijection between [2p + 2] and the union of the two elliptic curves. As such, if E and E‘ are both cyclic groups and have intractable discrete logarithm problems, then we can manufacture a one-way permutation on [2p + 2] as I describe in this StackExchange answer. This construction appears to have been discovered in 1991 by Burton S. Kaliski Jr.

Avoiding small cycles

Let’s suppose we want to use this one-way bijection inside a pseudorandom number generator, similar to Blum-Blum-Shub but without the existence of a trapdoor (the factorisation of M = pq). It is entirely plausible that φ contains fixed points or short cycles, which would result in periodic output from the pseudorandom number generator.

In particular, if we apply Construction II to a pair of curves with p = 2^255 − 19, we obtain a one-way permutation on [2^256 − 36]. For convenience, we can pad this to a permutation on [2^256] by extending it with the identity function on the remaining 36 points. This permutation has at least 36 fixed points, and plausibly a few more than that.

We’ll extend this to a one-way permutation on [2^384] as follows:

Split the input into a 256-bit ‘upper part’ and a 128-bit ‘lower part’. Apply the one-way permutation to the upper part, and apply a period-2^128 permutation (such as an appropriate affine function modulo 2^128) to the lower part. Then XOR the lower part into the upper part, in-place, to ‘perturb’ it.

By looking at the lower part in isolation, we know that this permutation does not have any cycles of length shorter than 2^128. Moreover, the interaction from the lower part to the upper part is designed to ensure that every orbit will benefit from the hardness of the elliptic curve discrete logarithm problem; without it, the 36 trivial fixed points of the permutation on [2^256] would each give rise to a weak random number generator (a 128-bit LCG).

To use this as an actual cryptographic random number generator, you’d need to choose a suitable output function — for instance, taking a RIPEMD-160 hash* of the 384-bit internal state — and then be careful** to implement the whole thing using constant-time instructions to avoid timing attacks. (On most CPUs, for example, the multiplication instruction lacks this property.)

*practical cryptographic random number generators such as ChaCha20 use less computationally intensive output functions, but in this case we don’t care because it still pales in comparison to the amount of computation in the state update function.

**harder than it seems, because the compiler may optimise your constant-time code into faster variable-time code. Unless you absolutely need the property that its previous internal states cannot be deduced from its current internal state, then I’d recommend using a well-tested implementation of ChaCha20 instead of this 384-bit PRNG.

Posted in Uncategorized | Leave a comment

Cyclotomic fields

The nth cyclotomic field is the field \mathbb{Q}[\zeta] generated by a primitive nth root of unity, ζ. It is an example of a number field, consisting of algebraic numbers, and its dimension is φ(n) when regarded as a vector space over \mathbb{Q}. If n is odd, then the nth cyclotomic field is identical to the 2nth cyclotomic field, so we’ll henceforth assume without loss of generality that n is even.

The intersection of \mathbb{Q}[\zeta] with the ring of algebraic integers is exactly the ring \mathbb{Z}[\zeta] generated by ζ. The first few examples of rings of this form are:

  • n = 2: \zeta = -1, so \mathbb{Z}[\zeta] is just the ring of ordinary integers \mathbb{Z}.
  • n = 4: \zeta = i, so \mathbb{Z}[\zeta] is the ring of Gaussian integers (red, below-left).
  • n = 6: \mathbb{Z}[\zeta] is the ring of Eisenstein integers (blue, below-right).

These three rings admit unique factorisation, meaning that every nonzero element can be uniquely expressed as a product of primes, up to reordering and unit migration. This turns out to be true for precisely the following thirty values of n:

2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30,
32, 34, 36, 38, 40, 42, 44, 48, 50, 54, 60, 66, 70, 84, 90

The corresponding values of φ(n) are:

1, 2, 2, 4, 4, 4, 6, 8, 6, 8, 10, 8, 12, 12, 8,
16, 16, 12, 18, 16, 12, 20, 16, 20, 18, 16, 20, 24, 24, 24

Note that, in particular, the maximum dimension is φ(n) = 24, attained only by the three largest values of n (70, 84, and 90). These rings, viewed as additive groups, are therefore free Abelian groups of rank 24.

And this is where it gets exciting: David Madore noticed that the existence of order-70 and order-84 elements in the Conway group Co0 give rise to highly rotationally symmetric projections of the Leech lattice into the complex plane. Here’s the order-84 projection applied to the first shell:

Projection of the first shell of the Leech lattice into the plane with order-84 rotational symmetry, by David Madore

And here’s the order-70 counterpart:

Projection of the first shell of the Leech lattice into the plane with order-70 rotational symmetry, by David Madore

By suitably scaling and rotating either of these projections, we can arrange for the image to contain the respective nth roots of unity. If we view the resulting projection as a map from \mathbb{Q}^{24} to \mathbb{Q}[\zeta], we can see that it is surjective (because its image contains the roots of unity, and these span the field) and therefore bijective (because the dimensions match).

In particular, that means that the projection remains injective when restricted to the Leech lattice, and its image must be a rank-24 Abelian subgroup of \mathbb{Q}[\zeta] which is closed under multiplication by roots of unity. Given that it has a finite integral basis, we can further scale/rotate the projection such that the images are all algebraic integers and therefore form an ideal in \mathbb{Z}[\zeta].

Thanks to these rings having class number 1, every ideal is principal, so this ideal must simply be a scaled/rotated copy of \mathbb{Z}[\zeta] itself! That is to say, when n is 70 or 84, we can identify the points in the Leech lattice with the algebraic integer elements of the cyclotomic field!

Working in the opposite direction, we can construct the Leech lattice starting from one of these cyclotomic fields by defining an inner product on its algebraic integer elements:

\langle u, v \rangle = tr(\alpha u v^{\star})

where α is a particular element of the ring (different choices yield different lattices), and tr is the trace operation (sum of all Galois conjugates).

There are similar constructions of the Leech lattice for some other values of n, where the class number is greater than 1 (and therefore unique factorisation does not apply), but the constructions could potentially require using a proper ideal rather than the whole ring of algebraic integer elements. For instance, Craig discovered a construction for n = 78, and Eva Bayer-Fluckiger gave a more thorough exploration.

The cases of n = 70 and n = 84 are arguably the most interesting, because they endow the Leech lattice with a compatible ring structure where unique factorisation holds: the nonzero lattice points form a multiplicative monoid isomorphic to the direct sum of a cyclic group of order n (the group of roots of unity) and the free commutative monoid on countably many generators.

Of these, the n = 84 ring is my favourite: it contains the Gaussian, Eisenstein, and Kleinian integers as subrings, and indeed is the smallest cyclotomic ring to do so.

Posted in Uncategorized | Leave a comment

Urbit for mathematicians

The three months since the last cp4space article have largely been spent learning about an interesting project called Urbit. My attention was drawn to its existence during a talk by Riva-Melissa Tez (formerly of Intel) on the importance of continued improvements to the computational efficiency of hardware.

A decent amount has been written on the Internet about Urbit, but it tends to either be non-technical, explaining the importance of the project, or technical but written in highly unconventional project-specific jargon. Instead, we’ll depart from either of these by examining the novel internals of Urbit that are most likely to be of interest to mathematicians and theoretical computer scientists.

Before we can do so, however, it is worth asking what Urbit is. This is often a source of confusion, because it is an ecosystem of multiple interacting components designed to mesh well together. If you were to metaphorically run 2-means on the set of Urbit components, they would cleanly separate into the operating system (Urbit OS) and identity system (Urbit ID).

The most interesting salient components of Urbit OS are:

  • Arvo: a deterministic operating system kernel implemented in a pure functional programming language (Hoon).
  • Nock: a very low-level functional programming language. It’s roughly halfway between a minimalistic Lisp dialect and the SKI combinator calculus. In the Urbit ecosystem, Nock is the machine code for the virtual machine in which the Urbit OS runs.
  • Vere: a software implementation (written in C) of the Nock virtual machine, capable of running the VM inside a Unix environment.
  • Hoon: a higher-level functional programming language in which the Urbit OS is directly written. This ultimately compiles down to Nock.

The identity system, Urbit ID, is a hierarchical system of integer addresses (analogous to IPv4 addresses) called Azimuth points. Azimuth points are securely associated to public keys (either mutably or immutably), and communication between machines running Urbit is built upon this public key infrastructure.

Azimuth points and their hierarchy

As mentioned, Azimuth points are non-negative integers. There are five such levels:

  • Points in [0, 2^8 – 1] are called galaxies.
  • Points in [2^8, 2^16 – 1] are called stars.
  • Points in [2^16, 2^32 – 1] are called planets.
  • Points in [2^32, 2^64 – 1] are called moons.
  • Points in [2^64, 2^128 – 1] are called comets.

These can cumulatively be represented by the datatypes uint8, uint16, uint32, uint64, and uint128, respectively. These datatypes possess standard ‘restriction maps’ familiar from using a programming language such as C:

uint64 → uint32 → uint16 → uint8

where, for instance, a 32-bit integer is coerced into a 16-bit integer by taking the lower half (or, equivalently, reducing modulo 2^16). There are two elegant ways to impose a ring structure on each of these 2^n-bit datatypes:

  • Integers modulo 2^(2^n), which form a ring;
  • Nimbers less than 2^(2^n), which form a finite field.

The first choice is the familiar one (in the C programming language, the operators {+, *, -} coincide with modular arithmetic in Z/(2^2^n)Z). In the second choice, ring addition is bitwise XOR, and ring multiplication is slightly more complicated to define.

For each of these two choices, the aforementioned restriction maps are ring homomorphisms.

Now, how do these restriction maps relate to Urbit? They impose a partial order on the set of Azimuth points, where xy if x is the image of y under a restriction map. The Hasse diagram of this partial order is a forest of 256 trees, each of which has a galaxy at its root. This Hasse diagram determines, for each Azimuth point that’s not a galaxy or comet, a unique parent of that Azimuth point. Most planets, for instance, have stars as parents.

The parenthood relationship for comets is different: the parent of a comet is its image under the restriction map uint128 → uint16, which must be one of five stars willing to host comets. Comets are direct hashes of public keys (so, unlike planets, can’t be transferred between different owners).

Planets, stars, and galaxies are tradable as non-fungible tokens on the Ethereum blockchain. This gives a modifiable decentralised mapping from each Azimuth point (32 bits and easily memorable) to an Ethereum address (160 bits and cryptographically secure). As such, it’s possible to trustlessly verify the owner of an Azimuth point and therefore set up a secure communication channel with that Azimuth point; all communications between machines running Urbit OS are established in this manner.

Names of Azimuth points

The integer representing an Azimuth point is converted into a human-readable name (called an ‘@p’) through the following process:

  1. For planets and moons (not for stars and galaxies), a Feistel cipher is applied to the lowest 32 bits of the integer. This is a reversible permutation designed to obfuscate the relationship between the name of a planet and its parent star, reflecting the fact that planets are permitted to move freely between stars.
  2. The bytes of the resulting (enciphered) integer are each converted to a three-letter syllable, using a separate lookup table for even-indexed bytes (suffixes) and odd-indexed bytes (prefixes). Each syllable consists of a generalised vowel flanked by two consonants.
  3. The syllables are written in big-endian, with a hyphen between each 16-bit word and a tilde at the beginning.

An example of a 32-bit planet name is ~sorreg-namtyv, which belongs to the departed founder of Urbit.

A more comprehensive description of the process, including the two 256-element syllable lookup tables, is given in this StackOverflow answer.

In addition to the pronounceable ‘@p’ identifier, there is a visual encoding called a sigil. This stores each byte of the (enciphered!) 32-bit integer in a separate quadrant of a 2-by-2 arrangement, again using a pair of 256-element lookup tables. The sigil for the planet ~firmer-hatful is shown here:

sigil for ~firmer-hatful

Gavin Atkinson describes the design decisions involved in creating this system of sigils. Planets with circular sigils and/or meaningful @p names (e.g. ~timber-walled) tend to be in greater demand, and therefore more expensive, due to the combination of higher scarcity and aesthetic appeal.

The lookup tables mapping bytes to quadrant tiles were populated manually (by drawing 512 tile designs), so the Urbit sigils can be thought of as partially procedurally generated. [A fully procedurally-generated system may also have worked, such as the system in this Wolfram blog post from 2009, but this is unlikely to change now.] This is, nonetheless, highly consistent with the @p names; the syllable lookup tables were also manually populated.

Nock and its emulator

We proceed to turn our gaze from Urbit ID to Urbit OS.

Nock is a low-level pure functional programming language described here. Code and data are simply binary trees (called nouns) whose leaves are nonnegative integers. The Nock interpreter is a unary function * whose domain and codomain are both the space of nouns; its semantics are defined recursively in terms of itself and other unary functions.

My favourite of these functions is the tree-addressing operator, /, which is defined as follows:

/[1 a]              a
/[2 a b]            a
/[3 a b]            b
/[(a + a) b]        /[2 /[a b]]
/[(a + a + 1) b]    /[3 /[a b]]
/a                  /a

In particular, if we let L(x) and R(x) refer to the left and right subtrees of a tree x, then we have:

  • /[1 x] = x
  • /[2 x] = L(x)
  • /[3 x] = R(x)
  • /[4 x] = L(L(x))
  • /[5 x] = R(L(x))
  • /[6 x] = L(R(x))
  • /[7 x] = R(R(x))

and so forth. In general, /[n x] can be evaluated by:

  • expressing n in binary (for example, 1729 is “11011000001”);
  • removing the leading ‘1’ (in this example, giving “1011000001”);
  • replacing each ‘0’ with ‘L’ and ‘1’ with ‘R’ (in this example, “RLRRLLLLLR”);
  • taking x and sequentially applying each of the characters to it (in this example, giving R(L(L(L(L(L(R(R(L(R(x)))))))))), noting that the outermost function is the least significant bit).

If we enumerate the nodes in an infinite binary tree according to the natural number required to address the subtree rooted at that node using the / operator, this agrees with a breadth-first traversal of the tree. This is the same enumeration (the salmon spiral below) that relates the Calkin-Wilf sequence to the Calkin-Wilf tree, shown in this illustration by David Eppstein:

The operator / is useful enough that Lisp distributions typically implement the special-cases of this function for all 2 ≤ n ≤ 15, referring to them as {car, cdr, …, cddddr}.

Nock, if evaluated naively, would be incredibly inefficient: decrementing an integer n takes time proportional to n. The emulator therefore uses a clever idea called jets: runtime optimisations where the emulator pattern-matches various patterns (such as the decrementation function applied to an integer) and calls a native C function to decrement the integer directly.

The Urbit project has succeeded in implementing an entire operating system kernel as a function in this language, and it is performant in practice owing to the use of jets. The Hoon compiler (which compiles programs written in the higher-level language Hoon into the low-level Nock language) generates code that will be successfully pattern-matched by these jets.

With jets, it’s possible to go even lower-level than Nock. Urbit doesn’t do this because there’s not really much point in doing so, except to give a formal definition of Nock in terms of an even simpler language:

Aside: implementing Nock in the SK combinator calculus

If we’re opting for absolute simplicity, why not go the whole way? It’s possible to implement Nock in the untyped lambda calculus or in the SK combinator calculus.

It would be tempting to try the following method of encoding nouns:

  • The nonnegative integers at the leaves of noun trees can be represented using Church numerals: n is the higher-order function which maps its argument f to its n-fold functional composition, f^n.
  • The ‘cons cells’ [a b] can be represented as Church pairs.

However, unfortunately there isn’t a way to query whether a noun is an integer or a pair. As such, we’ll need to use a slightly more complicated encoding of Nock nouns where we ‘tag’ each noun with its type:

  • encode_noun([a b]) := pair(0, pair(encode_noun(a), encode_noun(b)))
  • encode_noun(n) := pair(1, church_numeral(n))

The unary functions used to define the Nock interpreter can now be themselves implemented as combinators, up to and including the Nock interpreter * itself.

For example, the question-mark operator can be implemented by taking the first element of the Church pair (which is the Church numeral 0 or 1) and then ‘boxing’ it as a Nock noun instead of a ‘bare’ Church numeral:

  • ?(x) := pair(1, first(x))

where the ‘pair’ and ‘first’ functions are defined in the Wikipedia article.

The incrementation operator needs to unbox the integer, take its successor, and then rebox it, like so:

  • +(x) := pair(1, succ(second(x)))

Technically, we’ve cheated here, because according to the Nock definition, +(x) needs to also be defined on ordered pairs as well as integers. In particular, we need to use first(x) to extract a Church numeral specifying whether x is an ordered pair or an integer and branch on that: if the Church numeral is 1, then we can just return pair(1, succ(second(x))); if the Church numeral is 0, then we need to return +(x). The self-reference can be removed from this recursive definition by means of a fixed-point combinator.

The remaining operators are all recursive, so they’ll similarly require the fixed-point combinator in their implementations.

It would be a fun exercise to attempt to write a jet-propelled emulator for the SK combinator calculus, including optimisations such as internally representing Church numerals as integers (with a library such as GMP to allow arbitrarily large integers) and using jets to call the GMP implementations of arithmetic operations when appropriate.

By using the ‘compress identical subtrees’ idea, familiar from Hashlife and libraries for manipulating binary decision diagrams, we avoid excessive encoding overhead as well as enabling constant-time recognition of (standard implementations of!) functions such as the decrement operator.

Posted in Uncategorized | 3 Comments

A curious construction of the Mathieu group M11

Previously, we discussed which regular polytopes have vertex-sets that occur as proper subsets of the vertex-set of another regular polytope in the same dimension. In particular, when there is a Hadamard matrix of order 4k, then then the (4k−1)-dimensional simplex can be inscribed in the (4k−1)-dimensional hypercube. Moreover, the converse also holds.

Upon noticing that there exists a unique order-12 Hadamard matrix up to isomorphism, Daniel Sebald asked what the symmetry group is of the 11-dimensional simplex inscribed in the 11-dimensional hypercube. It can be shown that this is exactly the smallest sporadic group, M11, with exactly 11 \times 10 \times 9 \times 8 = 7920.

Another object with a related automorphism group is the perfect ternary Golay code. This is a linear subspace of \mathbb{F}_3^{11} consisting of 729 codewords with the property that any pair of distinct codewords differ in at least 5 coordinates. (Consequently, if you’re given a codeword with ‘errors’ in at most 2 coordinates, you can uniquely recover the original codeword.)

This turns out to be no coincidence! If we label the elements of the field of three elements as {−1, 0, 1}, then exactly 24 of the 729 codewords have no ‘0’-coordinates. These 24 codewords can be regarded as a subset of the vertices of the hypercube {−1, 1}^11, and geometrically they form a pair of mutually dual 11-dimensional simplices inscribed in the hypercube! Eleven dimensions are hard to visualise, so here is a three-dimensional analogue:

Cardboard stella octangula by Joseph Myers

The automorphism group of each individual simplex is the symmetric group S11, but the automorphism group that preserves the bounding hypercube is the Mathieu group M11. If you include both simplices, the groups are doubled (C2 × S11 and C2 × M11, respectively), where the C2 factor is the centre consisting of the identity matrix and its negation.

Posted in Uncategorized | 1 Comment

Assorted topics

This is a digest of things that have happened this month, but which are individually too small to each warrant a separate cp4space post. Firstly, there have been a couple of exciting results in the field of combinatorics:

  • The Erdős-Faber-Lovász conjecture is proved for sufficiently large hypergraphs by Dong Yeap Kang, Tom Kelly, Daniela Kühn, Abhishek Methuku, and Deryk Osthus.
  • A special case of another conjecture by Erdős has been established by Maria Chudnovsky, Alex Scott, Paul Seymour, and Sophie Spirkl.

Also, the functional analysis paper I coauthored with Tomasz Kania has been accepted by the journal Studia Mathematica. This is my first ever article in a peer-reviewed mathematics journal, finally giving me a finite Erdős number (specifically 4)! Many thanks go to the anonymous reviewer for his or her feedback.

The article on the Stockfish NNUE architecture reached the top position on Hacker News, resulting in a disproportionate amount of cp4space traffic (indeed, more views in a single hour than is typical in an entire week).

One quadrillion objects

The Catagolue census of objects arising from random 16-by-16 soups in Conway’s Game of Life has surpassed 10^15 objects. The total number of soups explored in this census is 46 trillion. One of these soups, discovered by Dylan Chen, takes a record-breaking 52513 generations to stabilise.

The parallel GPU-assisted soup search has examined nearly four times as many soups, specifically 164 trillion soups, but due to the search methodology not every object is censused. (Specifically, only soups that last sufficiently long or produce high-period oscillators or rare spaceships are rerun on the CPU and censused; ‘boring’ soups are discarded by the prefiltering stage on the GPU.)

Recent highlights of the GPU soup search include two variants of a period-7 oscillator, one by me and the other by Rob Liston. There was also a 50093-generation soup by Liston which held the longevity record for one week before being surpassed by Dylan Chen’s aforementioned 52513-generation soup.

Taken together, the CPU and GPU searches have simulated 210 trillion random 16-by-16 soups; you can view the collective results here.

BN curves where n has low Hamming weight

We previously discussed Barreto-Naehrig curves and the problem of trying to find curves where the (prime) number of points n on the elliptic curve has a low Hamming weight.

If x is the sum of two powers of 2, the Hamming weight of n is guaranteed to be at most 35, and heuristics based on the prime number theorem suggest that there should be infinitely many such values of x for which p and n are both prime. For example, x = 2^{250} + 2^4 is an example.

The situation is different for Hamming weights strictly below 35. Instead of a two-parameter family such as x = 2^a + 2^b, there appear to only be a finite collection of one-parameter families, and the same heuristics suggest that there are only finitely many examples. In particular, the largest such x that I could find was x = 33 \times 2^{267}, for which the Hamming weight of n is 34.

A particularly nice choice of x (in that it gives a reasonable security level without being too large) is x = 47 \times 2^{56}. The resulting values of n and p are both 252-bit primes, and the Hamming weight of n is 35. Here are the values in hexadecimal:

x = 0x2f00000000000000
p = 0xa787d240000000039081c0000000000cf180000000000011a00000000000001
n = 0xa787d240000000039081c00000000009b520000000000011a00000000000001

If you’re willing to have a smaller bit-length, then x = 17 \times 2^{43} provides a 194-bit prime where the Hamming weight of n is merely 29. Also, because x is congruent to 1 (mod 3), it follows that p is congruent to 4 (mod 9) and cube-roots can be computed efficiently in \mathbb{F}_p as described in Appendix B of the paper introducing BN curves:

x = 0x880000000000
p = 0x2de124000000565c800000006c60000000003300000000001
n = 0x2de124000000565c800000005148000000003300000000001

The security level is quite mediocre, though: it only offers 96-bit security against Pollard’s rho algorithm for discrete log.

Posted in Uncategorized | Leave a comment

Meagre sets and null sets

There are two competing notions for describing a subset of the real numbers as being ‘small’:

  • a null set is a subset of the reals with Lebesgue measure zero;
  • a meagre set is a countable union of nowhere-dense sets.

Both of these properties are downward-closed: an arbitrary subset of a null set is itself a null set, and an arbitrary subset of a meagre set is again a meagre set. Moreover, countable unions of meagre sets are meagre, and countable unions of null sets are null.

These two notions of a set being ‘small’ are also wholly incompatible.

In particular, there exist fat Cantor sets, nowhere-dense closed subsets of the unit interval which have positive Lebesgue measure arbitrarily close to 1. If you take a countable union of these sets (say, with Lebesgue measures of 1/2, 3/4, 7/8, 15/16, and so forth), the result is a meagre subset of [0, 1] with Lebesgue measure 1. If you take a countable union of translates of this set, each one occupying [n, n+1] for each integer n, the result is a meagre subset of the reals whose complement is a null set.

Stated more succinctly, there is a meagre set A (the one we’ve just constructed) and a null set B (its complement) such that A and B are disjoint and their union is the whole real line.

Moreover, A is an F_{\sigma} set (countable union of closed sets) and B is a G_{\delta} set (countable intersection of open sets). It’s possible to prove that every meagre set is a subset of a meagre F_{\sigma} set, and likewise every null set is a subset of a null G_{\delta} set. This turns out to be an ingredient of the proof of…

The Erdős-Sierpiński duality theorem

From what has been mentioned so far, there seems to be some abstract ‘duality’ between null and meagre sets. Erdős and Sierpiński proved, conditional on the continuum hypothesis, a beautiful result that makes this precise:

There exists an involution (self-inverse bijection) f on the set of reals such that {f(x) : x in C} is null if and only if C is meagre, and {f(x) : x in D} is meagre if and only if D is null.

This involution f is highly discontinuous, being constructed using transfinite induction up to 2^{\aleph_0} (which, by assuming the continuum hypothesis, is also equal to the first uncountable cardinal \aleph_1). Shingo Saito describes the construction in detail.

Posted in Uncategorized | Leave a comment

Keep your public keys private

Yes, the title sounds very counterintuitive. After all, don’t digital signature schemes require the general public to know your public key so that they can verify your signatures?

That is correct, but importantly they don’t need to know your public key until the very moment that you actually want to sign something. Instead, you can publish a hash of your public key long beforehand, and only publish your public key at the moment you digitally sign the message. The verifier then checks that the digital signature is valid using that public key, and then also checks that the public key is consistent with the hash you published.

The pseudonymous inventor of Bitcoin, Satoshi Nakamoto, must have realised this when he or she wrote the whitepaper. A Bitcoin address (to which you send coins) is a RIPEMD-160 hash of a SHA-256 hash of the elliptic curve public key. Importantly, this hash is not the same thing as the public key itself.

Why does this matter? It turns out that the hash offers a much stronger level of security than the elliptic curve:

In particular, if an attacker knows only your Bitcoin address, then there’s no easier way for an attacker to steal your coins than by brute-force: generate a random private key, derive the public key, hash it to obtain an address, and see if it matches. It would take an expected 2^160 iterations of this approach to steal your coins.

Note that the attacker will almost certainly end up with a different private key from the one you created, but it will ‘coincidentally’ be able to unlock the coins in the same address. This is because the 160-bit space of addresses is 2^96 times smaller than the 256-bit space of elliptic curve keys.

What if an attacker knows your elliptic curve public key? Then, using the Pollard rho algorithm, it ‘only’ takes on the order of 2^128 iterations to determine your private key. That’s still humongous; even with all of the computing power currently available on the planet, it would take millions of years to reverse-engineer your private key.

So why should you be concerned? There are two reasons:

  • There’s a polynomial-time quantum algorithm for breaking elliptic curve discrete logarithm. It uses the same ‘period-finding’ subroutine as Shor’s factorisation algorithm. It’s still far beyond current quantum computing technology, with the best published upper bound requiring 128 billion Toffoli gates to break a 256-bit elliptic curve discrete logarithm, but quantum computing progress has been accelerating in recent years.
  • More of a concern is that Pollard’s rho algorithm might not be the best classical algorithm for solving the elliptic curve discrete logarithm problem. Prime factorisation, for example, became much easier as recently as 1996 when the General Number Field Sieve was invented (also by Pollard). It’s plausible that a secret governmental organisation has a faster method of cracking elliptic curve discrete logarithm.

If you’re unconvinced that this second reason is even remotely plausible, and strongly believe that Pollard rho is obviously the best possible algorithm for solving elliptic curve discrete logarithm, then you should ask yourself why serious cryptographers such as Joe Silverman are bothering to develop alternative methods.

(Before you dismiss this as an argumentum ad verecundiam, note that this is not purporting to be an argument that elliptic curve discrete logarithm is definitely insecure. Rather, it’s an argument that there’s no known proof that it is secure, because if such a proof did exist, then there would have been no point in Silverman proposing an approach that is guaranteed to fail.)

So, 2^128 iterations should be regarded as ‘an upper bound, which is tight assuming that there will be no academic, industrial, or governmental progress on finding faster algorithms’. And if a large-scale fault-tolerant quantum computer is developed, be very afraid…

On the other hand, cryptographic hash functions such as SHA-256 and RIPEMD-160 (both of which are composed to derive the Bitcoin address from the public key) are designed to thoroughly mush the input in as chaotic manner as possible*. They have no nice mathematical structure by design, so it’s very unlikely that there’s a better approach than the 2^160-iteration brute-force algorithm described above.

*that’s not to say that hash functions are just kitchen sinks of arbitrary operations. They’re still very carefully designed to have good dispersion properties, be resistant to a bunch of different cryptographic attacks, produce statistically random output, and achieve these goals efficiently (in terms of speed in software/hardware implementations).

The take-home message

To reiterate:

  • if you share your public key, then your security level is “hopefully 128 bits, but maybe some organisation has found a faster method and is keeping it quiet”;
  • if you don’t share your public key until you absolutely need to (when signing a transaction), then your security level is “almost definitely 160 bits”.

You should feel much more confident if your Bitcoin is in an address where the world doesn’t know your public key. This means that whenever you spend any Bitcoin from an address (inevitably revealing the public key in the process), you should empty the entire address and send the remaining balance to a fresh unused address.

You’re still revealing your public key, but only very briefly: as soon as the block containing that transaction is confirmed, it becomes incredibly computationally difficult to rewrite the history. In essence, it only gives an evil adversary a maximum of a few minutes to try to break the discrete logarithm problem.

Many wallet implementations create fresh addresses for every transaction, so my recommendation is to use one of those (e.g. the Trezor hardware wallet).

Since ‘not reusing addresses’ is already known to be best practice, then you might be tempted to ask:

Is this advice really necessary?

Apparently so.

At the time of writing, someone has 94258 Bitcoin* in a regular (pay-to-public-key-hash) address which has revealed its public key. So, if you are reading this and are the owner of 1P5ZEDWTKTFGxQjZphgWPQUpe554WKDfHQ, then I’d recommend moving the balance to a fresh address imminently.

*that’s about 3 billion dollars.

For example, if you look at one of the recent outgoing transactions, the ‘sigscript’ is the following:

473044022100ff4fa243701eea0bdba8615a8bc4f01df387
aae215a90fe382eaf5405ee0ad73021f6f814e02b88e1fbb
1b0d5099970da6cf426ef58027c4b99b1afb633bb5b62801
21024b83426cf9bff257261d87f2f2858b51b2eea756c0123c7e05bc0a007425c9f2

The last line here (68 hexadecimal characters, i.e. 34 bytes) contains 0x21 (meaning ’33’) followed by the 33-byte public key (a point on the elliptic curve). That is to say, there’s currently 3 billion dollars resting in an oft-reused Bitcoin address with an exposed public key, protected only by the difficulty of the elliptic curve discrete logarithm problem.

That’s a very large and unnecessary bet to be making against the collective innovative capability of the world’s algebraic geometers and cryptographers…

Posted in Bitcoin | Leave a comment

The neural network of the Stockfish chess engine

Last time, we briefly mentioned the high-level differences between Stockfish and Leela Chess.

To recap, Stockfish evaluates about 100 million positions per second using rudimentary heuristics, whereas Leela Chess evaluates 40 000 positions per second using a deep neural network trained from millions of games of self-play. They also use different tree search approaches: Stockfish uses a variant of alpha-beta pruning, whereas Leela Chess uses Monte Carlo tree search.

An important recent change to Stockfish was to introduce a neural network to evaluate the positions in the search tree, instead of just relying on hardcoded heuristics. It’s still much simpler than Leela Chess’s neural network, and only slows down Stockfish to exploring 50 million positions per second.

The real cleverness of Stockfish’s neural network is that it’s an efficiently-updatable neural network (NNUE). Specifically, it’s a simple feedforward network with:

  • a large (10.5M parameters!) input layer, illustrated below, that can utilise two different levels of sparsity for computational efficiency;
  • three much smaller layers (with 17.5k parameters in total) which are evaluated densely using vector instructions;
  • a single scalar output to give a numerical score for the position, indicating how favourable it is for the player about to move.

Everything is done using integer arithmetic, with 16-bit weights in the first layer and 8-bit weights in the remaining layers.

The input layer

Let’s begin by studying the first — and most interesting — layer. Here’s an illustration I made using Wolfram Mathematica:

The inputs to the layer are two sparse binary arrays, each consisting of 41024 elements. It may seem highly redundant to encode a chess position using 82048 binary features, but this is similar to an approach (called ‘feature crosses’) used in recommender systems.

What are the two sparse binary arrays, and why do they have 41024 features? My preferred way of interpreting these two arrays are as the ‘worldviews’ of the white king and the black king. In particular, the differences are:

  • Coordinate systems: because black and white pawns move in opposite directions, the two players need to ‘see the board from opposite angles’. This already happens in regular (human) chess because the two players are seated opposite each other at the chessboard, so one player’s ‘top’ is the other player’s ‘bottom’. If you imagine each player numbering the squares from top to bottom, left to right, then any physical square would be called n by one player and 63 − n by the other player.
  • Piece types: instead of viewing the pieces as black or white, a player sees it as either ‘mine’ or ‘theirs’. The 10 non-king piece types are thus {my pawn, their pawn, my knight, their knight, my bishop, their bishop, my rook, their rook, my queen, their queen}.

The reason for these ‘player-relative coordinate systems’ is that it means that Stockfish can use the same neural network irrespective of whether it’s playing as white or black. The network uses both your king’s worldview and your enemy king’s worldview for evaluating a position, because they’re both highly relevant (you want to protect your own king and capture your enemy’s king).

So, why does each worldview have 41024 features? It can be seen as an outer product (or tensor product) of:

  • a 64-element feature encoding the position of the king whose worldview this is, in their own coordinate system. This is ‘one-hot encoding’, where exactly one of the 64 entries is ‘1’ and the other 63 entries are ‘0’.
  • a 641-element feature encoding, for each of the 64 × 10 ordered pairs (square, piece-type), whether or not that square is occupied by that piece. The 641st element is unused, and is (according to the Chess Programming Wiki) apparently a result of the network being ported from Shogi to chess.

Each of the two 64-by-641 outer product matrices* is then flattened (the matrix is ‘reshaped’ into a vector with the same entries) to yield the corresponding 41024-element ‘sparse worldview’. In the input layer, each of the two 41024-element sparse worldviews are then affinely transformed to form a 256-element ‘dense worldview’.

*Important note: the 41024-element sparse binary arrays are never explicitly materialised, either as a 64-by-641 matrix or as a 41024-element vector. The Stockfish NNUE effectively ‘fuses’ the construction of these sparse vectors with the subsequent affine transformation (described below), updating the 256-element dense worldviews directly when the configuration of pieces on the chessboard is modified.

There are two levels of sparsity which are utilised when computing this affine transformation from \mathbb{R}^{41024} to \mathbb{R}^{256}, allowing the network to be efficiently evaluated many times in a tree search:

  • the 41024-element implicit vectors are themselves sparse: the number of nonzero elements is equal to the number of non-king pieces on the board.
  • moving a piece typically changes very few of the entries of the vector: if it’s a regular non-king move, only 2 entries change; if it’s a non-king move with capture, then 3 entries change.

It’s this second aspect which warrants the name ‘efficiently updatable’: when a move is made (or unmade, since we’re doing a tree search), we only need to add/subtract a few 256-element matrix columns from the resulting ‘dense worldview’ to update it.

Unless a king is moved, this (2 or 3 vector additions/subtractions) beats summing all of the matrix columns corresponding to nonzero entries (up to 30 vector additions), which in turn unconditionally beats doing a regular dense matrix-vector multiplication (41024 vector additions). That is to say, the second-level sparsity is about 10 times more efficient than the first-level sparsity, which is in turn about 1000 times more efficient than naively doing a dense matrix-vector multiplication.

The two dense worldviews are concatenated according to which player is about to move, producing a 512-element vector, which is elementwise clamped to [0, 127]. This elementwise clamping is the nonlinear activation function of the input layer, and (as we’ll describe) the hidden layers use a similar activation function. We can think of this as a ‘clipped ReLU’, which is exactly what the Stockfish source code calls it.

The remaining layers

The two hidden layers each use 8-bit weights and 32-bit biases. The activation function first divides the resulting 32-bit integer by 64 before again clamping to [0, 127], ready to be fed into the next layer. The output layer also uses 8-bit weights and 32-bit biases, but with no nonlinear activation function.

The first hidden layer takes 512 inputs (the clamped concatenated worldviews) and produces 32 outputs. The second hidden layer takes those 32 values as inputs, and again produces 32 outputs. The output layer takes those 32 values as inputs, and produces a single scalar output.

Since these subsequent layers are applied to dense vectors, they can’t use the same ‘efficiently updatable’ approach as the input layer; that’s why they’re necessarily substantially smaller. They can, however, use hardware vectorisation instructions (SSE/AVX) to apply the linear transformation and activation function.

This scalar output is then further postprocessed using other Stockfish heuristics, including taking into account the 50-move rule which isn’t otherwise incorporated into the evaluation function.

Observe that the neural network doesn’t actually have complete information, such as whether a pawn has just moved two squares (relevant to en passant), whether a king is able to castle, and various other information pertaining to the rather complicated rules for determining a draw. This is okay: the network is only being used as a cheap approximate evaluation for a position; when deciding what move to make, Stockfish performs a very deep tree search and only uses this approximate evaluation in the leaves.

Equivalently, you can think of this as being a massive refinement of the approach of ‘look a few moves ahead and see whether either player gains a material or positional advantage’, using a neural network as a much more sophisticated position-aware alternative of crudely ‘counting material’.

This is a neural network, so the weight matrices and bias vectors of the layers were learned by training the network on millions of positions and using backpropagation and a gradient-based optimiser. Of course, for supervised learning, you need a ‘ground truth’ for the network to attempt to learn, which seems somewhat circular: how do you even gather training data? The answer is that you use the classical version of Stockfish to evaluate positions using the deep tree search, and use that as your training data.

In theory, you could then train another copy of the NNUE using the NNUE-enhanced Stockfish as the evaluation function, and then iterate this process. Leela Chess does the same thing: its current network is trained on positions evaluated by using deep lookahead with the previous network as the leaf evaluation function. Note that the construction of the training data is orders of magnitude more expensive than training the network with this data, as you’re doing thousands of evaluations of the previous network (owing to the deep tree search) to construct each item of training data for the new network.

Further reading

The network is described and illustrated on the Chess Programming Wiki, which also has tonnes of references to forum discussions and other references. The first description of an NNUE was a Japanese paper by Yu Nasu (who suggested it for the board game Shogi instead of chess); the paper has since been translated into English and German. There’s also the Stockfish source code, which is very well organised (there’s a directory for the NNUE) and clearly written.

Posted in Chess | 9 Comments