A combinatorial proof of Houston’s identity

Robin Houston recently discovered a rather interesting formula for the determinant of an n-by-n matrix. In particular, the formula improves upon the best known upper bound for the tensor rank of the determinant (viewed as a multilinear map which takes n vectors of dimension n — the rows of the matrix — and outputs the determinant).

Houston also provided a Mathematica notebook which allows you to verify it for a given finite n (fast for n ≤ 6; very slow for larger n). The linked example shows n = 5, for which there are 52 nonzero terms as opposed to the 120 in the standard Laplace expansion of the determinant.

The sum ranges over all partial equivalence relations (PERs) on the set of n elements. The sign of a partial equivalence relation, sgn(~), is the product, over all equivalence classes S, of (−1)^(|S|+1). The size of a partial equivalence relation, |[∼]|, is simply the number of equivalence classes.

The partial equivalence classes containing singleton classes each contribute 0 to the sum, so these terms can be neglected; the remaining terms are in bijective correspondence with the set of all equivalence classes, in other words the nth Bell number. As each term is a product, over all rows, of linear functions of each row, this establishes that the tensor rank of the determinant is upper-bounded by the nth Bell number.

Houston discovered this identity by thinking geometrically in 3 dimensions and then generalising the result; we shall present a combinatorial proof of the same identity. In particular, we show that the expanded formula simplifies to the usual signed sum over all permutations, and do so without involving any geometrical or linear-algebraic properties of the determinant.

Part I: multiplicity of non-permutations

Observe that if you fully expand Houston’s identity, the resulting monomials are of the form a_{1, f(1)} a_{2, f(2)} \cdots a_{n, f(n)} where f is some function from {1, 2, …, n} to itself.

Also, from the definition, these functions have the property that if x is a fixed point of f, then there are no other elements y such that f(y) = x. This is a strictly weaker condition than injectivity, so some of these monomials that appear in the expansion do not belong in that of the determinant; we shall therefore need to show that these monomials appear with a coefficient of 0.

Given such a function f, which terms of Houston’s identity contain the corresponding monomial when expanded out? Let’s visualise f as a directed graph on the vertex-set {1, 2, …, n} where each vertex has exactly one outgoing edge and introduce the following terminology:

  • Fixed point: a vertex x with f(x) = x (i.e. a self-loop);
  • Leaf: a vertex x such that there are no incoming edges, i.e. no y such that f(y) = x;
  • Nonleaf: any vertex that is neither a leaf nor a fixed point.

Given such an f, we can characterise precisely the partial equivalence relations ~ which, when viewed as terms in Houston’s identity, give rise to the monomial corresponding to f. In particular, they are those partial equivalence relations ~ which satisfy the following properties:

  • If x is a fixed point, then it does not appear in any equivalence class.
  • If x and y are nonleaves belonging to the same connected component of the graph, then x ~ y.
  • If x is a leaf, then either x ~ f(x) or x does not appear in any equivalence class.

Note that any connected component (other than a self-loop) must contain at least two nonleaves: if x is any vertex in the component, then it follows from the constraints on f that f(x) and f(f(x)) are two distinct nonleaves. Consequently, we can describe a compatible PER with an ordered pair of:

  • An equivalence relation R on the set of nontrivial connected components of the graph corresponding to f;
  • A (boolean-valued) indicator function ι on the set of leaves of f which specifies which leaves belong to equivalence classes in the PER.

The first term here determines the size of ~ (it’s the same as the size of R). Fixing such an equivalence relation R, the sign of ~ depends on the parity of the number of leaves l such that ι(l) = 1. In other words, if there are any leaves at all, then we have an equal number of positive and negative terms of each size, so they cancel out perfectly.

Part II: multiplicity of permutations

As such, we’ve established that the only monomials that appear with nonzero coefficients are indeed the ones corresponding to permutations! It remains to show that the coefficients are correct, but it means that the analysis is much simpler because we can henceforth assume that f is a permutation. There are no leaves at all, and the nontrivial connected components are cycles.

Letting C be the set of nontrivial cycles, recall that we have a PER ~ corresponding to each equivalence relation R on C. The corresponding term in Houston’s identity has a coefficient of:

|[\sim]|! \textrm{ sgn}(\sim) = |[R]|! \textrm{ sgn}(R) \textrm{ sgn}(f)

where sgn(f) is the sign of the permutation f. Summing over all such R, we get that the overall coefficient of the monomial corresponding to f is:

\textrm{ sgn}(f) \sum_R |[R]|! \textrm{ sgn}(R)

We want to show that this simplifies to sgn(f). We can rewrite it using Stirling numbers of the second kind:

\textrm{ sgn}(f) \sum_{k=1}^n k! (-1)^{n-k} S(n, k)

This sum is the alternating sum of the number of facets of each dimension in a solid permutohedron, so is equal to its Euler characteristic, which is 1 by contractibility. (There’s probably a more direct way to show this using inclusion-exclusion.) As such, it does indeed simplify to sgn(f), thereby establishing the validity of Houston’s identity.

Asymptotics and further discussion

As discussed, Houston’s identity establishes an upper bound of B_n (the nth Bell number) for the tensor rank of the determinant. This asymptotically saves a multiplicative factor of:

\sqrt{2 \pi n} (\log n)^n

over the Laplace expansion, which is a substantial (superexponential!) improvement. The previous state-of-the-art appears to be:


which is merely an exponential improvement over the Laplace expansion.

For practical calculation of large determinants over a field, it is far more efficient (polynomial time instead of superexponential time) to perform Gaussian elimination to reduce the matrix into an upper-triangular form and then simply take the product along the diagonal, so these asymptotics are less interesting in practice.

However, there still may be practical value in using this algorithm for small determinants, especially in applications where a branch-free algorithm is desired and multiplications are expensive. For example, it gives an 8-multiplication formula for a 3-by-3 determinant, instead of the 9 from the Laplace expansion, although the same author later discovered a simpler 8-multiplication formula which uses fewer additions/subtractions.

Can a formula be found using fewer than 8 multiplications? Even though the tensor rank for the 3-by-3 determinant is known to be 5, which implies that any ‘multilinear’ formula costs at least 8 multiplications, there may be a nonlinear formula which accomplishes the task in fewer multiplications.

This may seem paradoxical, but there’s an analogous situation in fast matrix multiplication: the best known upper bound for the tensor rank of 4-by-4 matrix multiplication is 49, by applying two levels of Strassen’s algorithm, but there is a little-known method by Winograd for multiplying two 4-by-4 matrices over a commutative ring using only 48 multiplications.

Posted in Uncategorized | 2 Comments

Tetrational machines

A pair of people called Pavel have independently developed remarkable automata that last record-breakingly long before halting. In both cases, the number of timesteps that it takes for each automaton to halt is so large that it cannot be written down except using iterated exponentiation. Iterated exponentiation is known as ‘tetration’, so we shall describe these automata as ‘tetrational’.

Busy Beaver record

The first one of these is a 6-state 2-symbol Turing machine by Pavel Kropitz last month which takes approximately:


timesteps to halt (where exponentiation is right-associative, so the rightmost exponentiation is applied first). This is a huge improvement over anything known before May 2022; prior to that, the record was small enough that you could explicitly write down the digits of the number of timesteps (a 36535-digit number).

The lifespan of the new record-holder cannot be stored in floating-point format, but can easily be stored and manipulated in a format called level-index notation. Robert Munafo’s HyperCalc calculator uses this format internally, which is how I was able to get the above approximation from the exact formula:


for the number of ‘1’s left on the tape when the machine halts. This is a lower bound on the number of timesteps that the Turing machine performs, and is tight in the sense that it doesn’t affect the leading digits in the ‘4.023873729…’ at the top of the power-tower.

Shawn Ligocki (who has also held the 6-state 2-symbol Busy Beaver record on a number of occasions, including for two brief 3-day stretches last month!) has written an excellent explanation of how the lifespan of Pavel’s Turing machine was computed.

Compact diehard

In the previous month, Pavel Grankovskiy built a configuration in Conway’s Game of Life which initially fits inside a 104-by-96 bounding box and takes approximately:


timesteps to cleanly and completely self-destruct. Here is the configuration, colour-coded by the author to indicate the various parts of the machinery:

The symmetrical object in the southwest corner is a spaceship which travels to the southwest at a speed of c/5 (slightly slower than that of the glider, which travels at c/4). Each side of the spaceship catalyses a ‘sawtooth’ mechanism, where a glider reaching the spaceship is converted into a block which is pulled by subsequent glider pairs.

The lower sawtooth is driven by a continuous gun, causing the blocks to materialise at exponentially sparse intervals (each cycle of materialising a block and pulling it to the origin takes 121 times longer than the previous cycle). The upper sawtooth is driven by the exponentially sparse output of the lower sawtooth, causing the blocks to materialise at tetrationally sparse intervals (each cycle being roughly 121 to the power of the previous cycle). Each time this happens, one of the neon-green objects at the far left of the pattern is deleted, until eventually activating a self-destruction sequence.

The clean self-destruction is accomplished by the salmon and orange objects: the salmon objects were manually placed by the author and ensure that the spaceship is cleanly destroyed from behind; the orange cells were placed by an automated search program which uses beam search to find seeds of destruction for a given input pattern.

Subsequently, various authors have managed to optimise this pattern, adding extra layers to the power tower whilst ensuring that the initial configuration fits within a bounding box with area strictly less than 10000 cells. It seems that the current record-holder has a tower of fourteen 10s followed by a real in the interval [1, 10), so it’s one level below the other Pavel’s Turing machine (not that it’s at all reasonable to compare the lifespan of a 6-state 2-symbol Busy Beaver with that of a sub-10000-cell diehard).

Brett Berger has written an article which discusses the design and history of Pavel’s diehard (there was an earlier design with a single sawtooth mechanism which lasted just over 10^870 timesteps, later optimised by Dean Hickerson up to 10^1595 timesteps).

Brett’s article also discusses a tetrational automaton of his own in a puzzle game called Opus Magnum. That has a lifespan of approximately 10↑↑41 timesteps in Knuth’s up-arrow notation: that is to say, a power-tower of 41 copies of the number 10. This contraption contains two organs called recursive filters, which perform the same duty as the sawtooth mechanisms in Pavel’s diehard.

Beyond tetration

Stacking more recursive filters corresponds to adding more up-arrows. Whilst the recently discovered 6-state 2-symbol Turing machine doesn’t seem to be obviously generalisable in this way, there have been other Turing machines which have taken advantage of this idea for more than a half-decade: in 1964, Milton Green described a family of Turing machines for the busy beaver problem, where the kth machine has 2k + 4 states and takes more than 3 \uparrow^k 3 steps to halt (with k up-arrows). In particular, the 10-state machine (k=3) takes more than 3↑↑7625597484987 steps to halt, vastly longer than the other automata that we’ve discussed so far.

The next barrier to break is to progress from these primitive-recursive functions to the Ackermann function (level ω in the fast-growing hierarchy). For the contraptions in Conway’s Game of Life and Opus Magnum, this would involve building an organ which can repeatedly build and stack additional recursive filters.

The researchers with usernames Deedlit and Wythagoras have built Turing machines which achieve levels ω (‘Ackermannian growth’) and ω+1 (‘expandal growth’), including an explicit 15-state machine which consumes the output of the 5-state Busy Beaver record-holder (which leaves 2046 copies of a particular motif on its tape) and uses the other 10 states to boost this to more than 2 \uparrow^{2046} 4 timesteps.

Further beyond this is level ε0 in the fast-growing hierarchy, a function that grows so quickly that it cannot be proved total in Peano arithmetic. Wythagoras found an 85-state machine based on the Kirby-Paris hydra that takes more than f_{\epsilon_0}(1907) timesteps to terminate.

Posted in Uncategorized | 4 Comments

Infinitely many rational dodecahedra

Thomas Blok and David Madore have recently made significant progress on the problem of finding rational dodecahedra inscribed in the unit sphere, culminating in an infinite parametric family of solutions.

In particular, Thomas began with the constrained version of the problem that I used to find the first solution: namely where the dodecahedron has an order-8 symmetry group generated by reflections in three orthogonal planes. After stereographic projection, we are left with the following picture:

These solutions are parametrised by six positive rational variables, {a, b, c, d, x, y}, as shown in the image above. Thomas made the observation that if we fix {c, d, x, y} and draw the three green circles, then if they intersect at a common point (a, b), that common point must necessarily be rational.

Consequently, the problem becomes much simpler:

Find positive rationals (x, y, c, d) with y < x < sqrt(c² + d²) such that the circumcircles of the three triangles:

  • {(0, −y), (0, y), (x, 0)};
  • {(0, y), (−c, d), (c, d)};
  • {(x, 0), (c, d), ((c² + d²)/x, 0)};

all mutually intersect at a single point.

David Madore responded to Blok’s MathOverflow question and found a degree-18 polynomial equation in 18 variables expressing the cointersection of three circles defined by 9 points. Madore’s polynomial has 27 873 486 terms, so it is rather unwieldy to manipulate.

With the additional constraint that the 9 points form a 3-by-3 symmetric matrix (thus there are only 12 distinct variables), as is the case for the dodecahedron problem, Madore discovered that the degree-18 polynomial has only 1 980 078 terms and factors as a product of three ‘boring’ degree-4 factors (corresponding to degenerate cases where some points coincide) and an ‘interesting’ degree-6 factor with 720 terms.

Blok substituted the coordinates from the dodecahedron problem into the degree-6 polynomial, obtaining a degree-8 polynomial equation in the variables {x, y, c, d}. Blok observed that it is again reducible, with ‘boring’ factors cy(c^2 + d^2 - x^2) and a degree-4 ‘interesting’ factor that can be written as:

xy(cy + dx) + (c^2 + d^2)(cx + dy - x^2 - y^2) = 0

This is homogeneous, because global scale factors do not affect whether a solution is valid or not. As such, fixing one variable (such as setting y = 1) merely fixes the global scale factor. On the other hand, fixing a second variable (such as setting x = 4) does remove a degree of freedom from the solution space. When fixing x = 4 and y = 1, as suggested, we obtain the following equation in the remaining variables:

c^2 d+4 c^3-17 c^2+4 c d^2+4 c+d^3-17 d^2+16 d = 0

Observe that this is a cubic equation rather than a quartic equation. (The same is true if we simultaneously fix c and d instead of x and y.) As such, it has an elliptic curve group law, and we can therefore construct new solutions from existing solutions:

The uppermost point here is the rational solution that was found by computer search. The remaining points in the diagram are generated from that point together with the point at infinity (which, unlike in Weierstrass curves, appears not to be a neutral element). Proceeding in this manner, we can produce infinitely many solutions with fixed x and y provided that we have at least one solution that is not a torsion point on the elliptic curve.

The point at infinity

We observed that the point at infinity does not appear to be a neutral element, so perhaps we can use that to generate solutions without needing an initial solution. In particular, this should work with any choices of x and y.

Note that the equation has two terms:

xy(cy + dx) + (c^2 + d^2)(cx + dy - x^2 - y^2) = 0

We can divide both terms by c^2 + d^2 to obtain an equivalent equation:

\frac{xy(cy + dx)}{(c^2 + d^2)} + (cx + dy - x^2 - y^2) = 0

The first term tends to zero as we approach infinity; the remaining terms are a linear equation in c and d. Consequently, the equation after discarding the first term…

cx + dy - x^2 - y^2 = 0

…is precisely the equation of the asymptote λ of the elliptic curve! Intersecting this with the line at infinity gives the point at infinity ∞ on the elliptic curve, at which the line λ is tangent. However, λ must also intersect the curve at another point, P, satisfying P + ∞ + ∞ = 0 (in the elliptic curve group law).

The point P is precisely the point where both terms in the elliptic curve equation vanish:

This is exciting, because it gives us a rational point on the correct component (the unbounded one) of the elliptic curve. Unfortunately, one of the two coordinates is negative, so this is not a valid solution to the dodecahedron problem (if you apply stereographic projection then some of the dodecahedron’s pentagonal faces intersect each other).

However, we can repeat the strategy of drawing a tangent line at this point and letting it reintersect our elliptic curve. This gives the following solution:

c = \dfrac{x \left(x^2+y^2\right) \left(2 x^2 y^2+x^4+5 y^4\right) \left(-3 x^6 y^2-2 x^4 y^4-3 x^2 y^6+x^8-y^8\right)}{(x-y) (x+y) \left(11 x^{10} y^2+11 x^8 y^4+18 x^6 y^6+11 x^4 y^8+11 x^2 y^{10}+x^{12}+y^{12}\right)}

d = \dfrac{y \left(x^2+y^2\right) \left(2 x^2 y^2+5 x^4+y^4\right) \left(3 x^6 y^2+2 x^4 y^4+3 x^2 y^6+x^8-y^8\right)}{(x-y) (x+y) \left(11 x^{10} y^2+11 x^8 y^4+18 x^6 y^6+11 x^4 y^8+11 x^2 y^{10}+x^{12}+y^{12}\right)}

which is positive when x ≥ 1.9397y and has a simple expression for the norm:

c^2 + d^2 = \dfrac{(x^2 + y^2)^3}{(x^2 - y^2)^2}

which, in particular, implies that the points (x, 0) and (0, y) are definitely inside the circle of inversion passing through (c, d). Consequently, not only is this an algebraic solution to the problem, but also it yields a valid geometric solution. The intersection points are given by more complicated rational functions of x and y:

a = \frac{x (x-y) (x+y) \left(2 x^2 y^2+x^4+5 y^4\right) \left(x^6 y^2+4 x^4 y^4+x^2 y^6+x^8+y^8\right) \left(11 x^{10} y^2+11 x^8 y^4+18 x^6 y^6+11 x^4 y^8+11 x^2 y^{10}+x^{12}+y^{12}\right)}{38 x^{24} y^2+111 x^{22} y^4+310 x^{20} y^6+459 x^{18} y^8+792 x^{16} y^{10}+682 x^{14} y^{12}+804 x^{12} y^{14}+379 x^{10} y^{16}+414 x^8 y^{18}+47 x^6 y^{20}+70 x^4 y^{22}-15 x^2 y^{24}+x^{26}+4 y^{26}}

b = \frac{y \left(2 x^2 y^2+5 x^4+y^4\right) \left(4 x^4 y^2+x^2 y^4+x^6+2 y^6\right) \left(3 x^6 y^2+2 x^4 y^4+3 x^2 y^6+x^8-y^8\right) \left(x^6 y^2+2 x^4 y^4+x^2 y^6+x^8+3 y^8\right)}{38 x^{24} y^2+111 x^{22} y^4+310 x^{20} y^6+459 x^{18} y^8+792 x^{16} y^{10}+682 x^{14} y^{12}+804 x^{12} y^{14}+379 x^{10} y^{16}+414 x^8 y^{18}+47 x^6 y^{20}+70 x^4 y^{22}-15 x^2 y^{24}+x^{26}+4 y^{26}}

Excitingly, when x and y are the two legs of a Pythagorean triple, the radius of the circle of inversion is itself rational, so we can scale down so that the circle of inversion is the unit circle. That means that, after stereographic projection, the dodecahedron is symmetric under reflection in the three coordinate planes.

Taking the (5, 12, 13) Pythagorean triple, the resulting dodecahedron has the following 20 vertices:

  • 4 vertices of the form (0, ±1307215, ±2236392)/2590417;
  • 4 vertices of the form (±6274632, 0, ±2787625)/6865993;
  • 4 vertices of the form (±64472300514372, ±382579158329275, 0)/387973568586253;
  • 8 vertices of the form (±386997531010983823086099699996473688, ±553410551662400041405804573591217100, ±568341850739218062368814809404334875) /882634005124184502065650762315319437;

all of which are rational points on the unit sphere.

Since we can generate Pythagorean triples parametrically, this gives us an infinite parametric family of rational dodecahedra in the unit sphere which are symmetric under reflections in the three coordinate axes.

Is it possible to find rational dodecahedra arbitrarily close to a regular dodecahedron? It seems that this should be the case, because (torsion points notwithstanding) each solution should generate a dense subset of the points on the correct component of the elliptic curve.

Posted in Uncategorized | 6 Comments


A field F consists of two compatible Abelian groups — an additive group on F and a multiplicative group on F \ {0} — such that multiplication distributes over addition.

In certain cases, though, this multiplicative group can be the additive group of a field structure on F \ {0}, giving a third Abelian group structure on F \ {0,1}. Let us call such an algebraic structure a threeld, because it is a generalisation of a field with three compatible Abelian group operations.

We’ll refer to the field structure on F as the outer field, and the field structure on F \ {0} as the inner field; the multiplicative group of the outer field coincides with the additive group of the inner field by definition.

If the outer field does not have characteristic 2, then it contains distinct elements {−1, +1}; these form an order-2 subgroup of the multiplicative group of the outer field, or equivalently an order-2 subgroup of the additive group of the inner field. It follows, therefore, that the inner field must have characteristic 2. However, there can only be at most two elements (±1) in the outer field which square to 1, so the inner field contains exactly two elements: the outer field is isomorphic to \mathbb{F}_3 and the inner field is isomorphic to \mathbb{F}_2.

In all other threelds, the outer field has characteristic 2, so we shall henceforth concentrate on this case.

We can fully characterise the remaining finite threelds. Because the multiplicative group of a finite field is cyclic, it must have prime order to support an inner field, so the finite threelds have outer field \mathbb{F}_{2^p} and inner field \mathbb{F}_{2^p-1} where 2^p - 1 is a Mersenne prime.

Infinite threelds

What about infinite threelds? If there are infinitely many Mersenne primes, then there are arbitrarily large finite threelds and the upward Löwenheim-Skolem theorem implies the existence of an infinite threeld (and indeed threelds of arbitrarily large infinite cardinalities).

Note that in an infinite threeld, the inner field must have characteristic 0; if it were of characteristic p, then every element in the outer field would be a pth root of unity, but there can only be at most p such roots.

Can we prove the existence of an infinite threeld without assuming the existence of infinitely many Mersenne primes?

Consider the first-order theory of fields of characteristic 2, augmented with the following additional infinite schema of axioms:

  • every element has a unique square root;
  • every element has a unique cube root;
  • every element has a unique 5th root;
  • every element has a unique 7th root;
  • every element has a unique 11th root;
  • […]
  • every element has a unique pth root;
  • […]

Every finite initial segment of these axioms has a model — firstly define:

N = (2 - 1)(3 - 1)(5 - 1)(7 - 1)(11 - 1) \cdots (p - 1) + 1

and then note that 2^N − 1 ≡ 1 (mod p) for each of these primes by Fermat’s little theorem. It follows that the (cyclic!) multiplicative group of the finite field on 2^N elements has order not divisible by p, so we can find a unique pth root of any element, and therefore this finite field is a model of that initial segment of axioms.

So, by compactness of first-order predicate logic, there exists a model satisfying all of these axioms. It must be a field of characteristic 2, and these axioms ensure that its multiplicative group contains pth roots of all elements. As an abelian group, it is torsion-free and divisible, and is therefore the additive group of a vector space over \mathbb{Q}.

Now, because every vector space over \mathbb{Q} can be made into the additive group of a field (a number field if the dimension is finite, or a transcendental field otherwise), the infinite field that we obtained by the compactness theorem can indeed be upgraded into a threeld. The Löwenheim-Skolem theorem then gives threelds of all infinite cardinalities.

Posted in Uncategorized | 6 Comments

29-year-old Conway conjecture settled

Ilkka Törmä and Ville Salo, a pair of researchers at the University of Turku in Finland, have found a finite configuration in Conway’s Game of Life such that, if it occurs within a universe at time T, it must have existed in that same position at time T−1 (and therefore, by induction, at time 0). Here is the configuration of live and dead cells, surrounded by an infinite background of grey “don’t care” cells:

The configuration was discovered by experimenting with finite patches of repeating ‘agar’ and using a SAT solver to check whether any of them possess this property. Similarly, one can use a SAT solver to verify that Törmä and Salo’s result is correct.

Since this configuration can be stabilised (by the addition of further live cells, shown in yellow) into a finite still-life, this demonstrates that not every still-life can be constructed by colliding gliders.

The first finite stabilisation was 374 cells, but this was promptly reduced to 334 cells by Danielle Conway and then to the 306-cell configuration above by Oscar Cunningham. Oscar moreover proved, again using SAT solvers, that this is the minimum-population stabilisation of the Törmä-Salo configuration.

Consequently, we have the following pair of bounds:

  • Every strict still-life with ≤ 20 cells can be synthesised by gliders.
  • There exists a strict still-life with 306 cells that cannot be synthesised.

More importantly, the Törmä-Salo result positively answers a question first posed by John Conway himself on 24th August 1992:

The things buildable by gliders (an idea I think first popularized
by Buckingham) are a nice class, mainly because they are provably of
infinite “age” (at least if you define them correctly). I’m sure we
should NOT believe, however, that everything of infinite age is so
buildable (even if most of us do). I expect that there is a still life
of such delicacy that in some essential sense it is its only ancestor –
though obviously that sense must allow for fading configurations outside
it, and probably allow for more.

This brings me to an interesting point – the false lessons experience
might teach us. Experience is a bad guide to large configurations – it
teaches us perhaps that there is no orphan, that almost all configurations
die down pretty soon – whereas almost all configurations ARE orphans, of
course, and PROBABLY almost all configurations grow infinitely, as you
asserted in your note, but I’m sure not meaning that it was provably true.

A non-constructible
Sorry – A non-(glider-)constructible configuration might be something
that’s almost an orphan, in that it can only arise from a similar
configuration at the previous time, which itself can only arise from … .

Indeed, is there a Godlike still-life, one that can only have existed
for all time (apart from things that don’t interfere with it)? I like
this one! I imagine it might be findable too, by a version of the searches
that found the old orphans (gardens-of-eden), but restricted to still-lifes.

Well, I’m going out to get a hot dog now, so will stop this. It was
originally intended to be only a very much shorter thank-you note, and so
was addressed only to you – please circulate it if you like. JHC

The construction also implies a solution to the generalised grandfather problem: a pattern which has an N-tick predecessor but not an (N+1)-tick predecessor. The diameter of such a pattern grows like Θ(sqrt(log(N))).

Previous results were known for small values of N (N=0 by Roger Banks, and N=1,2,3 by mtve). Recently Törmä and Salo settled the problem for all positive integers, but the diameter of the pattern implied by their proof grows like Θ(N). A few days later they discovered the pattern in this post, which implies the stronger (and indeed optimal up to a constant factor) result above.

In other GoL-related news:

  • David Raucci discovered the first oscillator of period 38. The remaining unsolved periods are 19, 34, and 41.
  • Darren Li has connected Charity Engine to Catagolue, providing approximately 2000 CPU cores of continuous effort and searching slightly more than 10^12 random initial configurations per day.
  • Nathaniel Johnston and Dave Greene have published a book on Conway’s Game of Life, featuring both the theoretical aspects and engineering that’s been accomplished in the half-century since its conception. Unfortunately it was released slightly too early to include the Törmä-Salo result or Raucci’s period-38 oscillator.
Posted in Uncategorized | 22 Comments

Training a random Gaussian generator

I’ve spent the last couple of months tackling the problem of designing an algorithm to rapidly generate high-quality normally-distributed pseudorandom numbers on a GPU. Whilst this may seem quite pedestrian, it turned out to be much more interesting than I’d anticipated: the journey involved Hermite polynomials, generating functions, badly approximable numbers, quasi-Newton methods, and nearest-neighbour spatial queries.

EDIT 2023-01-02: the code and lookup tables are now open-source (MIT-licenced) and included as part of Hatsya’s cpads library.

Traditional methods

There are a few well known methods for producing normally-distributed random numbers from uniformly-distributed inputs.

The simplest approach is to just generate a uniformly-distributed random variable and then apply the inverse CDF of the Gaussian distribution. Unfortunately, the inverse CDF of the Gaussian distribution is not easy to compute; it’s a transcendental function with no closed form, so evaluating it on a computer is difficult: you’d need to use a Taylor series or a Padé approximant with enough terms that the output distribution is sufficiently close (in terms of KL-divergence) to a true Gaussian distribution such that they cannot be easily distinguished.

The other problem with this approach is that the output is a deterministic function of a single uniform input. If the input is a random 32-bit integer, then this generator would only be capable of generating 2^32 different output values.

Even if the input is a random 64-bit integer, it would still have a major problem: any value that it can produce must occur with probability at least 1/2^64. If we generated 2^80 random outputs, then there would be far too many (~ 2^16) outside the interval [−9.5, 9.5] or far too few (exactly zero). On the other hand, a true Gaussian generator should produce an expected 2537 outputs with magnitude greater than 9.5.

A commonly used alternative is the Box-Müller method, which takes a pair of uniform random variables and emits a pair of independent standard Gaussians. It uses the fact that if (x, y) is a bivariate standard Gaussian, then the polar coordinates (r, θ) are independent and have simpler distributions: r² follows an exponential distribution and θ is uniformly distributed in [0, 2π].

Whilst it doesn’t involve anything as exotic as the inverse CDF of a Gaussian distribution, it still requires the computation of a logarithm and a square-root (to determine r) and trigonometric functions (to transform from polar coordinates to Cartesian coordinates), so it is also rather computationally expensive.

There are also algorithms based on rejection sampling, such as Marsaglia’s ziggurat algorithm, which are very performant on scalar architectures (such as CPUs) but less so on vector architectures (such as GPUs) because of the presence of conditional branching.

Algorithms based on summing independent random variables

If you were to neglect the cost of generating the uniform random numbers themselves, then an especially cheap (but low-quality) approximation to a standard Gaussian is obtained by taking 12 random U[0,1] numbers and computing their alternating sum:

A − B + C − D + E − F + G − H + I − J + K − L

This works relatively well, but not fantastically (you can visually distinguish between this probability density function (Irwin-Hall) and the counterpart for a true standard Gaussian):

Why is it such a good approximation? Simply put, each uniform random variable has a variance of 1/12, so summing them produces a unit-variance output, and the use of an alternating sum ensures that the output mean is 0. The central limit theorem tells us that summing lots of (finite-variance!) i.i.d. random variables together and rescaling will approximate a Gaussian, and 12 is reasonably close to ‘lots’.

We could improve the quality of the generator by using more uniform inputs, but in reality they’re not free: they have to be produced by a pseudorandom number generator such as PCG or xorwow. Another problem is that it’s far too thin-tailed: the kurtosis of the output is detectably less than 3 (platykurtic), and the number of outputs you need to sample to be sure of that only grows like the square of the number of independent uniforms that you’re summing: so you’d need to sum on the order of 2^20 independent uniforms per output if you want the generator to be able to produce 2^40 outputs before failing the test. Naturally, generating a million uniform random numbers to produce a single output is extremely costly, so the Irwin-Hall method is insufficient for our purposes.

Wallace’s method

Christopher Wallace proposed a stateful method of generating Gaussians, which admits efficient vectorised implementations such as this paper by Brent.

The basic idea is that if you have a set of n i.i.d. Gaussians, then applying an orthogonal matrix in SO(n) produces a different set of n i.i.d. Gaussians (which, of course, are not independent from the first set). Wallace’s generator maintains an internal ‘pool’ of n i.i.d. Gaussians and has the following state-update and output functions:

  • State update: apply a sparse random orthogonal matrix to the pool. This is often composed of a permutation and a block-diagonal matrix of Hadamard matrices. Either the permutation or the block-diagonal matrix of Hadamard matrices should be randomised (i.e. depend on bits emitted by a pseudorandom number generator) to ensure that the orthogonal matrix is unpredictable.
  • Output function: project down by taking a proper subset of the n Gaussians in the pool. Scale all of these by (the same) random χ²-distributed correction term to correct for the fact that the orthogonal matrix preserves the sum of squared Gaussians in the pool.

To overcome the sparsity of the state-update function and obtain superior mixing properties, one could repeat the state-update function several times between each consecutive output of the generator.

A new algorithm

The key ingredient of Wallace’s generator is the random orthogonal linear transformation that’s applied across the warp to update the state. I then had an idea: what if we use a Wallace-like orthogonal linear transformation not as a state-update function, but as an output function?

As with approaches such as the ziggurat method and Box-Muller, we presuppose that we have a good pseudorandom generator of uniform 32-bit integers, such as Melissa O’Neill’s PCG, and concentrate on the problem of converting this uniformly-distributed input into a normally-distributed output.

In particular, we propose the following architecture:

  • each of the 32 threads in the warp generates a uniform pseudorandom 32-bit integer;
  • each thread uses some of these input bits to sample two independent weak approximations to a Gaussian, for a total of 64 random variables;
  • a random orthogonal linear transformation is applied across the warp to this set of 64 random variables (viewed as a vector in R^64);
  • we project down to R^32 by linearly combining the two values in each thread to produce one double-precision floating-point value per thread;
  • a small uniform random perturbation is added to the output.

The weak approximations are loaded from a lookup table (using 8 bits of input randomness to load the absolute value from a 256-element lookup table, and a 9th bit to determine the sign). It is necessary to use this lookup table for the weak approximations rather than, say, a uniform distribution, because then our distribution would again be too platykurtic. An alternative could be to use Box-Muller to generate the input distributions, which would be high quality but expensive to compute.

The purpose of the orthogonal linear transformation is twofold: firstly, by the central limit theorem, the marginal distribution of the output is much closer to a Gaussian than the ‘weak approximations’ that we’ve linearly combined; secondly, it aggregates randomness from all 32 of the threads in the warp, so that each output is dependent on 1024 random input bits instead of just 32.

Each of the outputs can each be considered to be a weighted sum of independent random variables, similar to the Irwin-Hall approach where 12 uniform distributions are summed. The orthogonal linear transformation extracts multiple such weighted sums in parallel from the same set of input random variables; the results are non-independent by construction, but the orthogonality of the matrix means that they’re still (linearly) uncorrelated.

Because our approach is stateless (we sample fresh random variables and then apply an orthogonal transformation), we don’t require the annoying χ²-distributed correction term present in Wallace’s generator.

We make a few decisions for efficiency:

  • Everything before the final linear projection is done in 32-bit integer arithmetic, with the final linear combination happening in full double-precision floating-point arithmetic.
  • The lookup table resides in CUDA shared memory.
  • For perfect bank conflict avoidance, thread n only accesses elements of the shared memory array congruent to n (mod 16).
  • As such, the shared memory array requires 4096 entries (16 KB), consisting of 16 interleaved lookup tables of 256 entries each.
  • There is no reason for these 16 lookup tables to have identical contents, meaning that different residue classes (mod 16) of threads can load different distributions. It transpires that this flexibility does indeed help.

Our random orthogonal linear transformation is a product of sparse matrices, so as to limit the amount of communication between threads. Rather than describe it verbally, it is easiest to draw a dataflow schematic of what happens in each thread. Each of the yellow diamonds is a ‘conditional negate’ operation, which either performs the identity operation or negates its input with probability ½.

This can be viewed as consisting of 5 copies of the following layer, stacked vertically:

(Technically, if you stacked two of these layers together verbatim, then the ‘a‘ register would have two successive yellow diamonds between the two layers. However, a pair of successive probability-½ negations is functionally equivalent to a single probability-½ negation, so we can ‘fuse’ the two diamonds into one.)

What does this layer do? It uses 3 random bits to uniformly randomly select one of the eight 2×2 Hadamard matrices enumerated below and multiplies it by the input vector (a, b):

These are, ignoring the scale factor of sqrt(2), precisely the orthogonal transformations where each coordinate of the output depends equally on each coordinate of the input; that is to say, they are ‘maximally mixing’. Omitting the scale factor of sqrt(2) means that this is implementable just using addition/subtraction of integers, and is exact (no rounding errors).

The whole warp performs an FFT-style butterfly network of independent random 2×2 Hadamard matrices; by the end of that process, the ‘a‘ register of a thread contains an equal mix of the 32 input variables from that thread’s half-warp; the ‘b‘ register contains an equal mix of the 32 input variables from the other half-warp. The butterfly network is implemented using the GPU instruction SHFL.BFLY, which is generated by the compiler from the __shfl_xor_sync intrinsic.

Linear combination

Finally, the values in the two registers are linearly combined (in 64-bit floating point, where one register contributes 5/9 of the variance and the other contributes 4/9 of the variance) and an additional low-weight uniform random variable c is added for additional smoothing.

Why are the variance proportions 4/9 and 5/9? They were chosen because the coefficients in the linear combination, 2/3 and sqrt(5)/3, are in the ratio 1 : (φ − ½), which is difficult to approximate with rational numbers whilst still keeping the contributions roughly equal in magnitude.

If the ratio were rational, say in the ratio p : q with small integer coefficients, then the output would be (ap + bq)/f, so the output could only take on discrete integer multiples of f. As such, the output distribution could be distinguished from a Gaussian by looking at the Fourier transform of the distribution — it would have a massive peak at the frequency f and multiples thereof.

The final uniform random variable c is added with a low weight at the very end to further smooth out any high-frequency effects. This is especially important when the output is close to 0, because the absolute precision of an IEEE floating-point number is higher in the neighbourhood of 0; this final perturbation helps to randomise the low-order bits of the mantissa of the output.

How not to populate the lookup tables

My initial idea for populating the 4096-element lookup table was to take the values of the inverse CDF of the Gaussian distribution sampled at 4096 equally spaced points in the interval [½, 1]. This, after all, seemed like a natural way to produce a ‘discrete approximation’ to the Gaussian distribution. This produced the following initial attempt (each row is one of the 16 interleaved distributions).

This produced underwhelming results: after generating 2^38 Gaussians using this method and computing the empirical nth moment (mean of X^n) for the first 8 positive integers n, the higher even moments already fail a z-test (we know what the true moments are for a Gaussian, as well as their standard errors).

[----------] 1 test from Gaussian
[ RUN ] Gaussian.Moments
274877906944 trials completed.
moment 1: discrepancy = -2.79812e-06; value = -2.79812e-06
moment 2: discrepancy = -2.41684e-06; value = 0.999998
moment 3: discrepancy = -1.46413e-05; value = -1.46413e-05
moment 4: discrepancy = -0.000106436; value = 2.99989
moment 5: discrepancy = -5.0672e-05; value = -5.0672e-05
moment 6: discrepancy = -0.00155006; value = 14.9984
moment 7: discrepancy = 0.00010665; value = 0.00010665
moment 8: discrepancy = -0.0224637; value = 104.978
[ OK ] Gaussian.Moments (322261 ms)
[----------] 1 test from Gaussian (322261 ms total)

Clearly, we need better ideas.

Optimising the lookup table

The key idea is to note that sampling those 2^38 random Gaussians was actually unnecessary, because we can analytically compute the moments of the distribution produced by our generator. For this initial attempt, we can see that the moment test which fails first is the 8th moment, where the expected time to reach a 4-sigma statistic is 9.1 × 10^10.

However, this realisation that we can analytically compute the expected time to fail the moment tests is our salvation: given a loss function that’s a differentiable function of the 4096 numbers in the lookup table, we can use gradient descent to tweak those numbers to optimise the loss function! We’ll initialise the optimisation with our poor initial attempt and let gradient descent do the rest.

What loss function do we choose? Intuitively, we want the reciprocal of the minimum expected time to fail any of the moment tests. It is, however, important to note that the moments are just expectations E[X^n] of monomials for different powers n, and it may well be the case that there’s a polynomial p such that E[p(X)] provides a more effective test — that is to say, a linear combination of moments is more effective than any individual moment.

Up until now, we’ve been looking at moments, which are the expectations of the elements of the monomial basis {1, X, X^2, X^3, …}. It turns out to be much more elegant to use a different basis, namely the Hermite polynomials:

  • He_0(X) = 1
  • He_1(X) = X
  • He_2(X) = X^2 − 1
  • He_3(X) = X^3 − 3X
  • He_4(X) = X^4 − 6X^2 + 3

In particular, they are orthogonal polynomials with respect to a standard Gaussian: the expected value E[He_m(X) He_n(X)] is zero when m and n are different, and is n! otherwise. Other than the 0th Hermite polynomial, these all have zero mean, and the covariance matrix Σ is the diagonal matrix whose entries are factorials.

The word ‘moments‘ exists to describe the expected values of powers of a random variable; for brevity, we’ll analogously use ‘hermites‘ (yes, as a plural common noun!) to refer to the expected values of Hermite polynomials of a random variable.

That provides a very convincing choice of loss function: we take log(x^T P x), where P = inv(Σ) is the (conveniently diagonal!) precision matrix of the vector of the first n non-constant even Hermite polynomials under a standard Gaussian assumption, and x is the vector of the first n non-constant even hermites of our approximation of a Gaussian distribution. The inner product x^T P is proportional to the reciprocal of the worst time to failure of any polynomial test of degree <= 2n.

It turned out to be more numerically viable to optimise the loss function for the simple convolution of the 16 base distributions, even though the actual output of our generator involves more convolutions (the output depends on 64 random variables, 4 from each of the 16 base distributions). These extra convolutions only make the generator better, not worse, so you can view the loss function we’re using as a conservative bound on the generator’s true quality.

We consider a slightly generalised loss function, log(x^T D x) where D is an arbitrary positive-definite diagonal matrix which specifies the relative importance of the different Hermite polynomials.

I used PyTorch as the automatic differentiation framework, so that I only need to implement the forward function and not care about differentiating it myself. PyTorch is chiefly used for training neural networks, where the loss function is too complicated to be symbolically differentiated; instead, it uses ‘reverse-mode’ automatic differentiation (backpropagation).

In order to implement this loss function in PyTorch, it’s necessary to be able to do two things:

  • Compute the hermites of a discrete distribution;
  • Express the hermites of a linear combination of independent random variables in terms of the hermites of those random variables.

The first of these is easy with the recurrence relation for Hermite polynomials. The second is more involved, and ultimately required discovering what appears to be a new result:

Proposition: Let a, b, x, y \in \mathbb{R} with a^2 + b^2 = 1. Then we have:

He_n(ax + by) = \sum\limits_{k=0}^n \binom{n}{k} a^k b^{n-k} He_k(x) He_{n-k}(y)

(Observe that when a = b = \frac{1}{2} \sqrt{2}, this proposition is equivalent to G. Colomer’s first identity mentioned on the MathWorld page. [EDIT: Thanks to Blinkerspawn for pointing out the typo here; I’d originally written sqrt(2) instead of sqrt(2)/2])

Proof: By expanding the binomial coefficient in terms of factorials and pulling out the factor of n!, this is seen to be equivalent to saying that the sequence {He_n(ax + by) / n!} is the convolution of the sequences {a^n He_n(x) / n!} and {b^n He_n(y) / n!}. This is equivalent to saying that the exponential generating function of {He_n(ax + by)} is the product of the exponential generating functions of {a^n He_n(x)} and {b^n He_n(y)}. This generating function is already known, so we can equivalently express this as:

e^{(ax + by)t - \frac{1}{2} t^2} = e^{xat - \frac{1}{2}(at)^2} e^{ybt - \frac{1}{2}(bt)^2}

which is indeed true when a^2 + b^2 = 1. QED.

Using this proposition, we can express the hermites of the output distribution (and therefore our loss function) in terms of the hermites of the base distributions, which are in turn obtained from the raw values in the lookup table by means of the recurrence relation.

Then there’s the choice of optimisation algorithm to use. Neural networks often use either stochastic gradient descent or (more recently) Adam, which are great algorithms when we only have an approximate estimate of the loss function (based on a batch of training data). In our situation, though, we can compute the loss function exactly without any noise (floating-point issues notwithstanding).

As such, we instead use the BFGS algorithm as implemented in Reuben Feinman’s torchmin package. BFGS is a quasi-Newton method, maintaining an estimate of the inverse of the Hessian matrix at the current point (based on the history of gradients observed at previously visited points) and using that to determine the next step to take. It mimics true second-order optimisation algorithms such as Newton’s method, but avoids computing the true inverse Hessian at each step (which is very expensive).

Often people use a limited memory variant called L-BFGS, which is helpful if you don’t have enough memory to store a full Hessian matrix. In our case, there are only 4096 parameters, and a 4096×4096 matrix only occupies a very manageable 128 megabytes, so we opted for the original BFGS algorithm instead.

Local search

After this optimisation had converged (requiring occasional small random perturbations along the way to escape saddle points), there was a major problem: the entries in the lookup table are required to be integers, but the optimisation was performed over the reals!

To convert to integers, we multiply by a scale factor of 2^24 and then round to the nearest integer lattice point. This rounding can be quite damaging to the final loss function, so we then perform a local search on the vertices of the 4096-dimensional unit hypercube centred on the unrounded scaled vector, moving to the best vertex within a particular radius of the current vertex until we reach a local optimum.

Our local search algorithm is as follows:

  1. begin with a starting point (the nearest integer lattice point to the BFGS solution);
  2. set r = 1;
  3. for each of the 16 base distributions, try changing r coordinates. If the best improvement (across all 16 × (256 choose r) possibilities) reduces log(x^T D x) by at least 0.001, then move to that new point and jump to step 2;
  4. if no sufficient improvement was found, increase r (up to a maximum of 5) and jump to step 3;
  5. if no improvement was found at a local search of depth 5, terminate.

Since x^T D x is a positive-definite quadratic function of x, which is a linear function of each base distribution (when the other 15 distributions are held fixed), we can avoid evaluating (256 choose r) possibilities by instead performing a ball-tree-assisted nearest-neighbour Euclidean search with (256 choose floor(r/2)) query points and (256 choose ceil(r/2)) index points. We use the off-the-shelf implementation in scikit-learn. This gives an approximately quadratic speedup over brute-force, and makes the local search run relatively quickly (for each iteration, the search over all 16 base distributions takes a total of approximately 2 seconds for r = 3, 15 seconds for r = 4, and 90 seconds for r = 5).

Applying BFGS followed by local search to our initial inverse-CDF-based attempt, we arrive at an unusual set of distributions which yield a considerably better approximation to a Gaussian when convolved:

Choosing the coefficients of a, b, and c

Suppose that we have already populated the lookup tables as detailed in the previous section. Then we can determine the three coefficients by imposing the following constraints:

  • the ratio of the coefficients of a and b are sqrt(5) : 2;
  • the variance of the output distribution is 1;
  • the kurtosis of the output distribution is 3;

where each constraint is understood to mean ‘as close as possible to within floating-point error’.

Since the uniform perturbation is platykurtic (kurtosis less than 3), the random variables a and b need to be slightly leptokurtic (kurtosis greater than 3) in order for this system of constraints to admit a solution. If we can arrange for that to happen, then the output distribution is symmetric (so all odd-order moments are zero) and the 2nd and 4th moments are also correct.

Consequently, when performing the local search mentioned in the previous section, we added a barrier penalty to the loss function to force the convolution of the 16 distributions to be leptokurtic.

The parameters obtained in this manner result in a generator much better than our original attempt: it takes 1.6 × 10^30 outputs in order to fail a moment test at the 4-sigma level, up from 9.1 × 10^10 (the analogous number for the unoptimised distribution).

CUDA implementation

Here is an excerpt of the source code for the generator, minus the contents of the lookup table and the coefficients of a, b, and c in the linear combination. This matches the schematic detailed above, but is more complete as it shows how the random lookups, conditional negation, and final uniform random perturbation are obtained from the 32 bits of input randomness. (The full source code is in the cpads repository.)

#include "stdint.h"
#define _DI_ __attribute__((always_inline)) __device__ inline
#define shuffle_xor(x, y) __shfl_xor_sync(0xffffffffu, (x), (y))
#define COND_NEGATE(i, x) if (entropy & (1u << (i))) { x = -x; }
#define HMIX(i) { int32_t s = a + b; a -= b; b = shuffle_xor(s, i); }

// load the parameters for the random Gaussian generator
#include "parameters.h"

 * Creates a double-precision Gaussian random variable with
 * specified mean and standard deviation. The 'entropy' argument
 * should be a uniform random uint32.
_DI_ double warpGaussian(uint32_t entropy, const int32_t* smem_icdf,
    double mu=0.0, double sigma=1.0) {

    // bank conflict avoidance strategy:
    int laneId = threadIdx.x & 15;

    // use 16 bits of entropy to retrieve two random variables
    // (so the entire warp has 64 such random variables).
    // ....bbbbbbbb........aaaaaaaa....
    int32_t a = smem_icdf[(entropy & 4080) | laneId];
    int32_t b = smem_icdf[((entropy >> 16) & 4080) | laneId];

    // perform the first three layers:
    COND_NEGATE(19, a) COND_NEGATE(18, b)
    COND_NEGATE(17, a) COND_NEGATE(16, b)
    COND_NEGATE(15, a) COND_NEGATE(14, b)
    COND_NEGATE(13, a) COND_NEGATE(12, b)

    // create a uniform random odd int32 (centred on zero).
    int32_t c = (entropy ^ b) | 1;

    // use the lowest 4 bits of entropy (those that contributed the
    // least to the uniform variable c) for the last two layers:

    // at this point, a is a {+1,-1}-weighted sum of the 32 values
    // from this half-warp; b is a {+1,-1}-weighted sum of the 32
    // values from the other half-warp.

    double result = mu;

    // our output is the combination of two high-variance Gaussian
    // components and one low-variance uniform component:
        double a_scale = sigma * ordinary_params[0];
        double b_scale = sigma * ordinary_params[1];
        double c_scale = sigma * ordinary_params[2];
        result += a * a_scale;
        result += b * b_scale;
        result += c * c_scale;

    return result;

The only ‘surprise’ here should be the way that the uniform random variable c is obtained: it’s the result of XORing the 32 bits of input randomness with the register b and setting the last bit to 1:

int32_t c = (entropy ^ b) | 1;

The value b has just been freshly ‘shuffled in’ from other threads in the warp, so is independent, and because its distribution is symmetric it remains independent even when we conditionally negate it.

Both of these Boolean operations will be fused together in practice, as the CUDA compiler combines them into a ‘LOP3’ instruction — which applies an arbitrary 3-input 1-output bitwise Boolean function specified by a truth table in the form of an 8-bit immediate operand.

Why set the least significant bit to 1? Otherwise, it would be negatively biased (with a mean of −0.5) because the representable 32-bit signed integers are:

{−2^31, −(2^31 − 1), …, −2, −1, 0, 1, 2, … 2^31 − 1}

and the first of these values (bit pattern 0x80000000) ruins the symmetry of the set of representable integers. By setting the least significant bit to 1, we instead have a uniform distribution over the following set of odd integers:

{−(2^31 − 1), …, −5, −3, −1, 1, 3, 5, … 2^31 − 1}

which is symmetrical about 0. We can alternatively view this random variable c as the convolution of 31 Bernoulli random variables with supports:

{−1, 1}, {−2, 2}, {−4, 4}, {−8, 8}, …, {−2^30, 2^30}

which is helpful for analytically computing its moments. (The moments of a small finite distribution are easy to compute, and we can compute the moments of a convolution of two distributions knowing the moments of the original distributions. Because we have to use this idea anyway for the random variables a and b, we may as well use it again for c instead of having to analytically sum a polynomial-valued series.)

We have ignored the subtlety that a and c are not necessarily independent. (b is independent from each of a and c because it originates from the other half-warp.) However, a is conditionally negated according to the least significant bit of entropy, which does not feature in the computation of c (because it was masked out when we bitwise-OR’d the variable with 1), so it follows that a and c have zero linear correlation and also that the output distribution is still symmetrically distributed. Any nonlinear dependence between a and c can only therefore manifest in the 4th moment (and beyond) of the output distribution, where its effect will be negligible because the scale factor applied to c is so small.

Issues with small floats

Let’s assume for the moment that we’re generating standard Gaussians, for which μ = 0 and σ = 1. A slight problem with the generator is that the scale factors of a, b, and c are all multiples of 2^−95, which means that the output is necessarily a multiple of 2^−95.

This only poses an issue for very small outputs: for example, if an output is in the interval [−2^−43, 2^−43], then the lowest bit of the mantissa is forced to be zero. Once we’ve collected (say) 20 outputs in this range, which takes an expected 2.2 × 10^14 samples, then we can be pretty certain that the generator is flawed.

This can be remedied by expressing the scale factor for c as the sum of two double-precision floats instead of as one double-precision float. Then, the last part of our implementation will resemble this:

double result = mu;

double a_scale = sigma * ordinary_params[0];
double b_scale = sigma * ordinary_params[1];
double c_scale_hi = sigma * ordinary_params[2];
double c_scale_lo = sigma * ordinary_params[3];
result += a * a_scale;
result += b * b_scale;
result += c * c_scale_hi;
result += c * c_scale_lo;

return result;

In doing so, the minimum ‘quantization’ is reduced from 2^−95 to 2^−150, which means that it will take 7.9 × 10^30 outputs to detect the flaw, comparable in magnitude to the moment tests.

It should be stressed that the additional fused multiply-add will slightly slow down the speed of the generator, and it is unlikely to matter for most practical applications of Gaussian random numbers (such as Monte Carlo simulations) where the presence of floating-point rounding errors in downstream calculations are likely of greater effect than the reduced relative precision in the neighbourhood of zero.

Testing with PractRand

Although we have analytically determined that the generator won’t fail any of the moment tests within 10^30 outputs, suggesting that the tail behaviour is correct, there could be other unforeseen problems in the generator. For example, had the coefficients of a, b, and c in the final linear combination been small integer multiples of some common divisor, there would be high-frequency issues with the generated distribution even though the tail behaviour could still be correct.

Consequently, I decided to test this using the PractRand tests following the instructions detailed here. In addition to the suggested patch by Daniel Lemire, it was necessary to change the return type in the declaration of show_checkpoint from ‘double’ to ‘void’, as the function doesn’t actually return anything and was segfaulting when I tried to run it.

PractRand tests that the input is uniformly distributed, whereas the output of the Gaussian generator is supposed to be normally distributed. To test the correctness of the output distribution, it is therefore necessary to apply the CDF of a standard Gaussian distribution to transform it into a supposedly-uniform random variable:

double gaussian = output_host_gauss[i];
double uniform = std::erf(gaussian * 0.7071067811865476);
res[i] = (uint32_t) ((1.0 + uniform) * 2147483648.0);

The error function, erf, produces a uniform random variable in the open interval (−1, 1) when given an input whose probability density function is proportional to exp(x²). Since the probability density function of a standard Gaussian is exp(½x²), it is necessary to divide the standard Gaussian by sqrt(2) before feeding it into the error function. Originally I forgot this factor of sqrt(2), and the PractRand tests reassuringly failed (by a huge margin, with p-values less than 10^−1000) at the first checkpoint.

Comparison of the error function, the affine-transformed error function, and the CDF of a standard Gaussian

Then, to convert this into a uniformly random uint32, we apply an affine transformation that maps (−1, 1) to (0, 2^32) and then apply the floor function to the result (implicitly, by casting to an unsigned 32-bit integer).

After running PractRand for approximately four days, the following output was obtained:

$ test/gpu/gpu_gaussian | PractRand/RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0xc7ea1bec
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 256 megabytes (2^28 bytes), time= 2.7 seconds
no anomalies in 124 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 512 megabytes (2^29 bytes), time= 5.6 seconds
no anomalies in 132 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 1 gigabyte (2^30 bytes), time= 11.5 seconds
no anomalies in 141 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 2 gigabytes (2^31 bytes), time= 22.7 seconds
Test Name Raw Processed Evaluation
[Low1/32]FPF-14+6/16:(4,14-2) R= +6.3 p = 2.7e-5 unusual 
...and 147 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 4 gigabytes (2^32 bytes), time= 45.0 seconds
no anomalies in 156 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 8 gigabytes (2^33 bytes), time= 89.9 seconds
Test Name Raw Processed Evaluation
Gap-16:B R= +4.0 p = 2.5e-3 unusual 
...and 164 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 16 gigabytes (2^34 bytes), time= 178 seconds
no anomalies in 172 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 32 gigabytes (2^35 bytes), time= 353 seconds
no anomalies in 180 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 64 gigabytes (2^36 bytes), time= 707 seconds
no anomalies in 189 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 128 gigabytes (2^37 bytes), time= 1406 seconds
Test Name Raw Processed Evaluation
[Low1/32]DC6-9x1Bytes-1 R= -4.2 p =1-6.2e-3 unusual 
...and 195 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 256 gigabytes (2^38 bytes), time= 2785 seconds
no anomalies in 204 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 512 gigabytes (2^39 bytes), time= 5625 seconds
no anomalies in 213 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 1 terabyte (2^40 bytes), time= 11165 seconds
no anomalies in 220 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 2 terabytes (2^41 bytes), time= 22407 seconds
no anomalies in 228 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 4 terabytes (2^42 bytes), time= 45386 seconds
no anomalies in 237 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 8 terabytes (2^43 bytes), time= 90285 seconds
no anomalies in 244 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 16 terabytes (2^44 bytes), time= 213575 seconds
no anomalies in 252 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 32 terabytes (2^45 bytes), time= 395812 seconds
no anomalies in 260 test result(s)

There are no test failures here, nor anything suspicious. Occasionally certain tests are encountered with p-values around 0.0001 or 0.9999 and marked ‘unusual’, but they are not too surprising since thousands of tests are performed in total: these ‘unusual’ results are merely green jelly beans.

Tail tests and measuring the speed of the generator

The GPU can generate random Gaussians much more quickly than they can be copied over the PCI-express cable and analysed by PractRand on the CPU. Consequently, we consider a modified tail test, where we copy only the Gaussians whose absolute value exceeds 4. We expect the convolutions to smooth out the behaviour of the distribution in the bulk, so the tails are worth testing more thoroughly.

Only one out of 15787 Gaussians are expected to fall in these tails, so we can better match the relative throughput of the Gaussian generation on the GPU and PractRand analysis on the CPU.

We account for this when mapping to a uniform distribution:

std::vector<uint32_t> retrieve() {

    std::vector<uint32_t> res(output_host_count[0]);

    for (size_t i = 0; i < res.size(); i++) {
        double gaussian = output_host_gauss[i];
        double uniform = std::erf(gaussian * 0.7071067811865476);

        if (uniform > 0) {
            uniform -= 0.9999366575163338;
        } else {
            uniform += 0.9999366575163338;

        // map to interval (0, 2)
        uniform = uniform * 15787.192767323968 + 1.0;

        res[i] = (uint32_t) (uniform * 2147483648.0);

    return res;

Running this on a Volta V100 produces the following output:

$ test/gpu/gpu_gaussian_tail | PractRand/RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x6f2e57d4
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 128 megabytes (2^27 bytes), time= 4.2 seconds
no anomalies in 117 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 256 megabytes (2^28 bytes), time= 9.1 seconds
no anomalies in 124 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 512 megabytes (2^29 bytes), time= 17.8 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-1,T) R= -6.5 p =1-2.9e-3 unusual 
...and 131 test result(s) without anomalies

rng=RNG_stdin32, seed=0x6f2e57d4
length= 1 gigabyte (2^30 bytes), time= 35.7 seconds
no anomalies in 141 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 2 gigabytes (2^31 bytes), time= 70.9 seconds
no anomalies in 148 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 4 gigabytes (2^32 bytes), time= 140 seconds
no anomalies in 156 test result(s)

In particular, it takes 140 seconds to produce 4 gigabytes of output, or equivalently 2^30 outputs (they’re converted to uint32s before piping into PractRand). However, these 2^30 outputs were subsampled from approximately 15787 times as many Gaussians (because we did the 4-sigma tail threshold), so the GPU is actually generating 121 billion Gaussians per second.

121 billion double-precision floats together occupy 968 GB, so it’s producing Gaussian random numbers faster than the Volta can even load/store to global memory (900 GB/s). In other words, it’s faster to run this (in whatever Monte Carlo simulation kernel is consuming the random Gaussians) than it is to load precomputed random Gaussians from the GPU’s global memory.

The GPU is running at full utilisation and PractRand is only showing a long-term average of 37% CPU usage, so (thanks to the tail thresholding) we’re no longer limited by the speed of PractRand. Let’s check up on the PractRand results:

rng=RNG_stdin32, seed=0x6f2e57d4
length= 8 gigabytes (2^33 bytes), time= 281 seconds
no anomalies in 165 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 16 gigabytes (2^34 bytes), time= 562 seconds
no anomalies in 172 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 32 gigabytes (2^35 bytes), time= 1116 seconds
no anomalies in 180 test result(s)

So far this has consumed about 20 minutes on an AWS p3.2xlarge instance, so roughly $1 of compute to generate a total of 136 trillion Gaussians (7 femtodollars per Gaussian!) and test the approximately 8 billion of those that land in the 4-sigma tails.

Consequently, to run PractRand to its default limit of 2^45 bytes on this tail test would take a fortnight and cost roughly $1000, so I’m loath to run it all the way to completion. Let’s instead stop at two terabytes, which takes nearly 20 hours ($60 on AWS):

rng=RNG_stdin32, seed=0x6f2e57d4
length= 64 gigabytes (2^36 bytes), time= 2238 seconds
no anomalies in 189 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 128 gigabytes (2^37 bytes), time= 4468 seconds
no anomalies in 196 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 256 gigabytes (2^38 bytes), time= 8899 seconds
no anomalies in 204 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 512 gigabytes (2^39 bytes), time= 17848 seconds
no anomalies in 213 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 1 terabyte (2^40 bytes), time= 35668 seconds
Test Name Raw Processed Evaluation
[Low1/32]Gap-16:A R= -3.9 p =1-2.5e-3 unusual 
...and 219 test result(s) without anomalies

rng=RNG_stdin32, seed=0x6f2e57d4
length= 2 terabytes (2^41 bytes), time= 71144 seconds
no anomalies in 228 test result(s)

Conclusions and acknowledgements

Between these empirical tests and theoretical analysis, it appears that this is a suitable method of generating non-cryptographic pseudorandom normally distributed random variables on a GPU, particularly suited for computationally intensive scientific and industrial applications such as Monte Carlo simulations.

It should not be used for cryptographic applications: there has been no attempted cryptanalysis of the output function, and based on the simplicity of the construction it is likely to be vulnerable to cryptanalytic attacks. Without the linear projection at the end, it would be especially trivial to invert; given the final a and b registers of all threads in a warp, it’s possible to determine (with about 2^32 effort) the random Hadamard transformation, the values loaded from the lookup table, and therefore the lower 28 bits of the input entropy.

If you need cryptographic random Gaussians (for RLWE, for instance) then you should use Dwarakanath and Galbraith’s paper instead.

Finally, many thanks go to Rajan Troll for relevant discussions over the last two months that immensely helped with developing this random Gaussian generator.

Posted in Uncategorized | Leave a comment

Involutions on a finite set

Suppose that R is the (assumed to be finite) set of solutions to a certain problem, and you’re interested in determining the parity of |R|. The following proof strategy works surprisingly often, namely in at least two different scenarios, and leads to very elegant proofs:

  • Embed R in a larger set S equipped with an involution f : S → S, such that R is precisely the set of fixed points of f;
  • Define a second involution g : S → S, such that the set T of fixed points of g is easily seen to have the correct parity (usually by having 0 or 1 elements).

Then we have |R| = |S| = |T| (mod 2).

We’ll discuss two examples of this proof technique.

Example I: Zagier’s one-sentence proof

To prove Fermat’s theorem that every prime of the form 4k + 1 is expressible as the sum of two squares, Zagier famously provided the following ‘one-sentence proof’ discussed here:

Here, g is the complicated piecewise involution defined above, and f is the involution which interchanges the second and third coordinates. The set R is the set of ways to write p = x^2 + 4y^2, and the set U is the singleton set containing the fixed point of the complicated involution g.

Observe that this proof is constructive: you can use these involutions f and g to construct a solution to the original problem! Specifically, if you start at the (unique) element of U and alternately apply f and g, the other end of the path is an element of R:

Christian Elsholtz similarly noticed this construction here albeit with the Greek letters α and β to refer to the two involutions.

At first, I though that this was a trick rather than a technique. In the words of Donald Knuth:

A trick is a clever idea that can be used once, while a technique is a mature trick that can be used at least twice.

That was until recently when I saw the following.

Example II: Hamiltonian cycles on cubic graphs

The problem is to show that, given a 3-regular graph G with a distinguished edge x, that there exists an even number of Hamiltonian cycles passing through x. Using this proof approach, we let:

  • S be the set of proper 3-edge-colourings (with colours red, green, and blue) of the graph G such that x is assigned the colour blue;
  • g be the involution that swaps red and green;
  • f be the involution that swaps blue and green on every blue-green component other than the one containing the edge x.

The fixed points of f are those where the green and blue edges form a single (Hamiltonian) cycle. There are no fixed points of g, so U is the empty set. It follows, therefore, that |S| and |R| are also even.

Again, this proof is constructive: by alternately applying f and g, this gives you an explicit self-inverse way to transform any Hamiltonian cycle through x into a different Hamiltonian cycle through x:

This is a natural bijection, in that it doesn’t involve making any arbitrary choices such as labelling the vertices of G. Even better, it’s acceptable for constructivists since we can provide an explicit upper bound on the length of the path of alternating applications of f and g: it’s no longer than the size of S, which is in turn upper-bounded by 3^(3n/2); there are 3n/2 different edges, each of which is either red, green, or blue.

Anyway, does this technique have a name? If not, I’d propose ‘Zagiery’ after the discoverer of the first proof. And are there any other applications besides the two that we’ve looked at?

Posted in Uncategorized | 7 Comments

Hamming backups: a 2-of-3 variant of SeedXOR

EDIT (2021-10-14): I’ve written a reference implementation of the Hamming backup idea introduced in this article.

SeedXOR is an approach for splitting a Bitcoin wallet seed, adhering to the BIP39 standard, into N ‘parts’ (each the same size as the original) which are each a valid BIP39 seed. It is used by the COLDCARD wallet and preferred over more complicated Shamir secret sharing schemes on the basis that:

  • the SeedXOR calculations are simple to do, such that they can be performed with pencil and paper without the aid of a computer;
  • each of the N parts of a SeedXOR backup is itself a valid BIP39 seed, so they can contain ‘decoy wallets’ and thereby steganographically hide the fact that they’re even part of a SeedXOR backup at all!

On the other hand, Shamir secret sharing enables ‘M-of-N’ backups, where only M parts are necessary to reconstruct the original seed (instead of all N parts).

Here we present a ‘2-of-3’ variant of SeedXOR which retains many of the desirable properties of the original SeedXOR:

  • the seed X is used to construct three ‘parts’ (each the same size as the original seed, as in SeedXOR), called A, B, and C;
  • as in SeedXOR, each of these three parts is indistinguishable from a standard BIP39 seed (giving rise to plausible deniability);
  • this is backwards-compatible with SeedXOR, in that if you have all three parts, then you can reconstruct X by simply taking A ⊕ B ⊕ C (and recomputing the checksum);

However, unlike a regular SeedXOR backup, this modified variant allows for full recovery of the original seed even when one of the three parts is missing:

  • given any two of the three parts, the seed X can be recovered.

This new variant is called a ‘Hamming backup‘ because the construction is based on an extended Hamming code. Generating a Hamming backup and recovering a seed from a Hamming backup can be done using only the XOR operation, as with SeedXOR. It’s slightly more complicated than SeedXOR in its construction (which is inevitable, because this implements a 2-of-3 scheme instead of an N-of-N scheme), but still possible to do with pencil and paper using only the XOR operation.

In particular, we firstly regard the 256-bit seed X as the union of two 128-bit values, X1 and X2, and similarly regard the parts A, B, and C as the unions A1+A2, B1+B2, C1+C2. We’ll explain later how to split a BIP39 seed in this manner.

The linear (bitwise XOR) relations that we impose between the different 128-bit values are summarised in the following diagram. (The analogous diagram for ordinary SeedXOR would be a single circle containing N+1 values, namely the seed X together with the N parts A, B, …)

This diagram has three circles, each of which has 4 values inside (in the case of the top-right circle, these are A1, A2, X1, C2) and 4 values outside (in this case, B1, B2, C1, X2). If four values are either all inside the same circle or all outside the same circle, then we impose the relation that they all must bitwise-XOR to zero. For example:

  • A1 ⊕ A2 ⊕ X1 ⊕ C2 = 0
  • B1 ⊕ B2 ⊕ C1 ⊕ X2 = 0

and analogously for the other two circles:

  • B1 ⊕ B2 ⊕ X1 ⊕ A2 = 0
  • C1 ⊕ C2 ⊕ A1 ⊕ X2 = 0
  • C1 ⊕ C2 ⊕ X1 ⊕ B2 = 0
  • A1 ⊕ A2 ⊕ B1 ⊕ X2 = 0

Equivalently, for each of these quartets, any one of the four values is given by the bitwise XOR of the other three.

Further linear relations can be derived from combining existing relations; for example, X1 = A1 ⊕ B1 ⊕ C1 and X2 = A2 ⊕ B2 ⊕ C2. Indeed, there is a group-like closure property: if you take any 3 of the 8 values in the diagram and bitwise-XOR them together, the result of the calculation is one of the 8 original values.

We’ll now discuss how these relations enable the seed X to be recovered from any two of the three parts {A,B,C} of a Hamming backup. Then we’ll discuss how to actually produce a Hamming backup {A,B,C} from an existing seed X together with a source of secure randomness (specifically, by generating A randomly and then deterministically deriving B and C from A and X).

Recovering a seed from a Hamming backup

The relations have the property that, given parts A and B, we can recover the seed X as follows:

  • X1 = A2 ⊕ B1 ⊕ B2 (they’re all inside the left circle)
  • X2 = A1 ⊕ A2 ⊕ B1 (they’re all outside the lower-right circle)

This can be done in three applications of XOR, rather than four, by precomputing the common subexpression A2 ⊕ B1. As such, this is only 50% slower than recovering a seed from a 2-of-2 SeedXOR backup, rather than twice as slow.

Moreover, the whole diagram has order-3 cyclic symmetry, so you can do the same thing with parts B and C, or with parts C and A. That is to say, you can recover the original seed with any two of the three parts.

Important note: when using two parts to generate the seed X, it is important that the two parts are used in the correct order (swapping the roles of A and B, for instance, would result in us generating the remaining part C instead of the desired secret X). If we made this mistake (and recovered C instead of X), then X can be derived by straightforwardly XORing together A, B, and C.

Constructing a Hamming backup

How does one construct a Hamming backup of an existing seed X? The first part, A, should be securely uniformly randomly generated. This will ensure that no information about X is leaked to an attacker who possesses only one of the parts {A, B, C}. (Like SeedXOR, one-time pads, and Shamir secret sharing, this is an information-theoretically secure construction, rather than a construction that depends upon the computational intractability of some problem.)

The remaining parts of the backup can be derived by moving anticlockwise around the diagram, obtaining each value from the XORing the previous two (in cyclic order around the diagram) together with one of the values from the seed X. Specifically, we compute:

  • B1 = A1 ⊕ A2 ⊕ X2;
  • B2 = A2 ⊕ B1 ⊕ X1;
  • C1 = B1 ⊕ B2 ⊕ X2;
  • C2 = B2 ⊕ C1 ⊕ X1;
  • A1 = C1 ⊕ C2 ⊕ X2;
  • A2 = C2 ⊕ A1 ⊕ X1;

The last two lines here are redundant — they compute the values A1 and A2 with which we already started — but are useful as a validity check in case you made errors whilst computing B1, B2, C1, and C2. An independent check is to load {A,B,C} into a COLDCARD hardware wallet, pretending that it’s a regular 3-of-3 SeedXOR. If the Hamming backup has been created correctly, this should load the original seed X = A ⊕ B ⊕ C.

BIP39-friendly way to split X into X1 and X2

If we only cared about the machine implementation of Hamming backups, then it would be straightforward to split the 256-bit value X (i.e. the entropy part of the seed, with no checksum) into an initial 128-bit segment and a final 128-bit segment.

However, part of the reason for favouring SeedXOR and Hamming backups over more complicated Shamir schemes is the ability to perform the calculations relatively quickly using pencil and paper without the aid of a computer.

The BIP39 standard appends an 8-bit checksum to the 256-bit seed (to create a 264-bit value) and then splits it into 24 words of 11 bits:

  • word 1: [11 seed bits]
  • word 2: [11 seed bits]
  • word 3: [11 seed bits]
  • […]
  • word 23: [11 seed bits]
  • word 24: [3 seed bits + 8 checksum bits]

These are each represented either as English words (from a 2048-element wordlist) or as 3-digit hexadecimal numbers where the first digit encodes 3 bits and the other two digits each encode 4 bits, making up the total of 11.

The hexadecimal representation is used when computing SeedXOR manually; a 16×16 lookup table is provided for applying the bitwise XOR operation to hexadecimal digits.

So that Hamming backups can be conveniently computed using the same mechanisms used by ordinary SeedXOR, we propose splitting the 24-word seed X into the 128-bit components X1, X2 as follows:

  • X1 contains words 1–11 followed by the first two hex digits of word 12;
  • X2 contains words 13–23 followed by the first hex digit of word 24 and the third hex digit of word 12.

On paper, we would write the hexadecimal digits of X1 directly above those of X2, like so:

Observe that only the first hex digit of word 24 is included (since the other two are checksum digits). Also, the first two digits of word 12 contribute to X1, whereas the third digit contributes to X2.

Crucially, the hex digits in the seed have been arranged into the two rows in such a way that each 4-bit digit in the X1 row appears above a 4-bit digit in the X2 row, and each 3-bit digit in the X1 row appears above a 3-bit digit in the X2 row. This means that performing bitwise XOR operations between different rows doesn’t ever yield an invalid value (such as causing the first hexadecimal digit in a word to be 9, which is outside the valid range 0..7).

Relationship with Shamir secret sharing

Hamming backups are isomorphic to a special case of Shamir secret sharing, where the degree of the polynomial is 1 (so it’s a 2-of-N scheme) and the construction operates over the field GF(2^2) of four elements. The three parts {A,B,C} encode the value of the polynomial evaluated at the three nonzero elements of the field, {1,ω,ω+1}, and the seed X encodes the value of the polynomial evaluated at 0.

Since we’re only dealing with a degree-1 polynomial, this construction is linear (which is why we were able to implement it exclusively using XOR). Shamir schemes based on higher-degree polynomials are not amenable to this technique, so do not give rise to arbitrary M-of-N analogues of Hamming backups.

The fact that Hamming backups only offer a 2-of-3 scheme is a limitation compared with the arbitrary M-of-N schemes possible with general Shamir secret sharing. It may be possible to implement M-of-N schemes using more sophisticated binary error-correcting codes, but doing so entails further sacrificing the simplicity of the original SeedXOR scheme.

Posted in Uncategorized | 1 Comment

An efficient prime for number-theoretic transforms

My new favourite prime is 18446744069414584321.

It is given by p = \Phi_6(x) = x^2 - x + 1, where x = 2^{32}. This means that, in the finite field \mathbb{F}_p, 2^32 functions as a primitive 6th root of unity, and therefore 2 is a primitive 192nd root of unity. It turns out that this field possesses multiple properties which make it well-suited to performing number-theoretic transforms (the finite field analogue of the Fast Fourier Transform).


Firstly, note that arithmetic is especially convenient in this field. An element of the field can be represented as an unsigned 64-bit integer, since p is slightly less than 2^64. We can also efficiently reduce modulo p without involving multiplication or division. In particular, if we have a non-negative integer n less than 2^159, then we can write it in the form:

n = A 2^{96} + B 2^{64} + C

where A is a 63-bit integer, B is a 32-bit integer, and C is a 64-bit integer. Since 2^96 is congruent to −1 modulo p, we can then rewrite this as:

B 2^{64} + (C - A)

If A happened to be larger than C, and therefore the result of the subtraction underflowed, then we can correct for this by adding p to the result. Now we have a 96-bit integer, and wish to reduce it further to a 64-bit integer less than p. To do this, we note that 2^64 is congruent to 2^32 − 1, so we can multiply B by 2^32 using a binary shift and a subtraction, and then add it to the result. We might encounter an overflow, but we can correct for that by subtracting p.

In C/C++, the algorithm is as follows:

This involves no multiplication or division instructions. Indeed, we can take a look at the compiled assembly code by using the Godbolt Compiler Explorer:

Observe that, using LLVM as the compiler, the resulting code is branchless (despite the word ‘if’ appearing twice in the source code) and all of the instructions are particularly cheap.

Now, why would we end up with a 159-bit integer in the first place and want to reduce it? There are two occasions where this subroutine is useful:

  • To perform a multiplication modulo p, we use machine instructions to multiply the two 64-bit operands to give a 128-bit result.
  • Left-shifting a 64-bit integer by up to 95 bits gives a 159-bit result.

The latter allows us to cheaply multiply an element of our field by any power of 2. In particular, if we want to multiply by 2^(96a + b), then do the following:

  • if a is odd, then subtract the input from p (in-place), relying on the fact that 2^96 is congruent to −1;
  • shift-left by b bits (to give a 159-bit result) and reduce using the subroutine.

That is to say, multiplying by any 192nd root of unity is ‘fast’, bypassing the need to perform a general multiplication. (This is particularly useful for GPUs, where there isn’t a 64-bit multiplier in hardware, so multiplication expands to many invocations of the hardware multiplier.)

Other roots of unity

As well as possessing these 192nd roots of unity, the field also contains roots of unity of any order dividing p − 1. This factorises as follows:

p - 1 = 2^{32} \times 3 \times 5 \times 17 \times 257 \times 65537

where those odd prime factors are the five known Fermat primes. We can perform a Fourier transform of any length that divides p − 1, since all of the requisite roots of unity exist, but this will be especially efficient if we only involve the smaller prime factors.

Most of these other roots of unity are ‘slow’, meaning that there is no obvious way to multiply by them without performing a general multiplication. There is, however, a nice observation by Schönhage: \sqrt{2} = \zeta + \zeta^7, where ζ is a primitive 8th root of unity, so we can efficiently multiply by the square-root of 2 (which is, of course, a primitive 384th root of unity).

A length-N FFT involves the Nth roots of unity, so we would want to use FFTs of length dividing 192 (or 384 using the Schönhage trick) whenever possible. This suggests the following algorithm for convolving two sequences:

  • choose an FFT length N that divides 3 \times 2^{32} and is large enough to accommodate the result of the convolution;
  • write N as a product of (at most 5) numbers dividing 384. For example, in the largest case we could write N = 128 × 128 × 128 × 96 × 64;
  • use this decomposition as the basis of a mixed-radix Cooley-Tukey FFT. That is to say, we apply:
    • N/128 FFTs of length 128;
    • N/128 FFTs of length 128;
    • N/128 FFTs of length 128;
    • N/96 FFTs of length 96;
    • N/64 FFTs of length 64.

Since there are at most 5 rounds of FFTs, we only need to apply arbitrary twiddle-factors (which can be precomputed and stored in a table) 4 times, i.e. between successive rounds of FFTs. Compare this to an FFT over the complex numbers, where there are only 4 ‘fast’ roots of unity (±1 and ±i) and therefore irrational twiddle factors must be applied much more often.

Fürer’s algorithm

Martin Fürer’s fast integer multiplication algorithm uses this idea recursively, expressing the integer multiplication problem as a convolution over a suitably chosen ring with plenty of fast roots of unity. In particular, he writes:

We will achieve the desired speed-up by working over a ring with many “easy” powers of ω. Hence, the new faster integer multiplication algorithm is based on two key ideas:

  • An FFT version is used with the property that most occurring roots of unity are of low order.

  • The computation is done over a ring where multiplications with many low order roots of unity are very simple and can be implemented as a kind of cyclic shifts. At the same time, this ring also contains high order roots of unity.

The field \mathbb{F}_p for p = 2^{64} - 2^{32} + 1 is a fixed-size ring with these desirable properties. It supports multiplication of integers up to 3 \times 2^{35} bits (12 GB) by writing them in ‘balanced base 2^16’ (where each ‘digit’ can be between −32768 and 32767). Each of the two integers will occupy at most 3 \times 2^{31} digits, so the result can be computed by a convolution of length 3 \times 2^{32}. The maximum absolute value of a coefficient that could occur as a result of such a convolution is 3 \times 2^{61} < \frac{1}{2} p, so the coefficient can be recovered exactly if the convolution is performed over \mathbb{F}_p by reducing each coefficient into the range [−p/2, +p/2].

Of course, Fürer’s paper was interested in the asymptotics of integer multiplication, so it needs to work for arbitrarily large integers (not just integers of size at most 12 GB). The remaining idea was therefore to apply this idea recursively: to split a length-N integer into chunks of size O(log N), and use Fürer’s algorithm itself to handle those chunks (by splitting into chunks of size O(log log N), and so forth, until reaching a ‘base case’ where the integers are sufficiently small that FFT-based methods are unhelpful).

The number of layers of recursion is therefore the number of times you need to iteratively take the logarithm of N before log log log … log log N < K, where K is the base-case cutoff point. This is called the iterated logarithm, and denoted \log^{\star} N.

Each layer of recursion in Fürer’s algorithm costs a constant factor F more than the previous layer, so the overall complexity is O(N \log N F^{\log^{\star} N}). It was the asymptotically fastest multiplication algorithm between its invention in 2007 (taking the crown from Schönhage-Strassen) and 2016 (when an O(N log N) algorithm was discovered).

Practical integer multiplication

In practice, large integer multiplication tends to use either Schönhage-Strassen (in the case of the GNU Multiprecision Library), or floating-point FFTs, or a bunch of number-theoretic transforms over different primes with a final application of the Chinese Remainder Theorem (in the case of Alexander Yee’s y-cruncher). The primes used by these number-theoretic transforms don’t support any ‘fast’ roots of unity, though; all of the twiddle factors are applied using multiplications.

This suggests that a number-theoretic transform over the field of 18446744069414584321 elements may indeed be competitive for the sizes of integers it supports (e.g. up to 12 GB). For larger integers, we can use the Schönhage-Strassen algorithm with this prime field as a base case (one layer of Schönhage-Strassen being more than enough to support multiplication of integers that can be stored in practice).

Posted in Uncategorized | 8 Comments

Hamming cube of primes

Given two nonnegative integers, m and n, we say that they are Hamming-adjacent if and only if their binary expansions differ in exactly one digit. For example, the numbers 42 and 58 are Hamming-adjacent because their binary expansions 101010 and 111010 differ in a single position.

If m and n are Hamming-adjacent, then their absolute difference |n − m| is necessarily a power of 2. The converse is not true, though; 24 and 32 have a difference of 8, but their binary expansions 011000 and 100000 differ in three positions.

We can form a countably infinite graph G on the set of all nonnegative integers by connecting two vertices if and only if they’re Hamming-adjacent. G is a bipartite graph: two integers can only be Hamming-adjacent if one is odious and the other is evil.

G is the union of nested hypercube graphs: the first 2^d nonnegative integers form a hypercube graph of degree d. For example, here’s the subgraph induced by the first 32 naturals:

What about if we take the induced subgraph on only the vertices which are primes? For the primes below 32, we get the following graph:

It’s a connected graph! It remains connected if we repeat for the primes below 64:

What about for the primes below 128? Now it has an isolated vertex, 127, and an isolated edge:

When we go further and look at the primes below 512, the vertex 127 is no longer isolated; it has been connected into the main component via the prime 383:

Similarly, the edge between 89 and 73 becomes assimilated into the main component once we increase our horizon to the primes below 1024.

This raises the question: does every prime eventually become connected to the main component? Or, equivalently, when we form the countably infinite induced subgraph H of G whose vertices are all of the primes, is H connected?

Isolated vertices

It turns out that the answer is no: the prime 2131099 is an isolated vertex in H with no neighbours whatsoever! (It is plausibly the smallest such example.)

How does one even prove that 2131099 is an isolated vertex? In other words, how do we show that all of the Hamming-adjacent integers are composite? Firstly, note that the Hamming-adjacent integers smaller than itself are a finite set obtained by subtracting one of the powers of 2 in its binary expansion:

33947, 2098331, 2130075, 2130971, 2131083, 2131091, 2131097, 2131098

These can all be verified to be composite. But what about the infinitely many Hamming-adjacent integers that are larger than itself? These are necessarily of the form $2131099 + 2^k$ for some value k. It transpires that every element in this set must be divisible by at least one of the primes {3, 5, 7, 13, 17, 241}, called the covering set. In particular, we have:

  • k = 1 (mod 2) implies divisible by 3;
  • k = 0 (mod 4) implies divisible by 5;
  • k = 1 (mod 3) implies divisible by 7;
  • k = 2 (mod 12) implies divisible by 13;
  • k = 2 (mod 8) implies divisible by 17;
  • k = 6 (mod 24) implies divisible by 241;

and together these cover all residue classes modulo 24.

We can go further and show that there are infinitely many such isolated vertices. In particular, every number of the following form:

n = 1885107369798300628981297352055 h + 3316923598096294713661

has a covering set of primes for all numbers of the form n \pm 2^k, as discovered by Christophe Clavier. Specifically, there are two covering sets of primes: one for the ‘+’ case and one for the ‘−’ case; the union of these two sets is just the set:


of prime divisors of the linear coefficient.

So, we just need to show that there are infinitely many primes of this form that are not Hamming-adjacent to any of the primes in the above set. The latter condition is easy to enforce — for example, by insisting that n is congruent to 4095 mod 4096 (i.e. its binary expansion ends in twelve ‘1’s). Then we are left with an arithmetic progression whose offset and common difference are coprime, and a theorem of Dirichlet states that there are infinitely many primes in such an arithmetic progression.

A difficult conjecture

Given the existence of infinitely many isolated vertices, we postulate the following conjecture:

Conjecture: Other than the infinitely many isolated vertices, there is exactly one connected component in H.

What would a counterexample look like? The smallest such counterexample would be a pair of Hamming-adjacent primes with p > q and no further neighbours. The usual method of covering sets would only work under the following set of conditions:

  1. p has a covering set P preventing any larger Hamming-adjacent primes;
  2. q has a covering set Q preventing any larger Hamming-adjacent primes, with the single exception of p, and this is only possible if p is itself a member of Q;
  3. The finitely many numbers smaller than and Hamming-adjacent to p all happen to be composite, with the single exception of q;
  4. The finitely many numbers smaller than and Hamming-adjacent to q all happen to be composite.

The second of these conditions is incredibly implausible, because the primes appearing in a covering set are usually much smaller than the dual Sierpinski numbers that arise from such a covering set. Here we are asking for the prime in the covering set to be larger!

Note that this conjecture is definitely unsolved, because it’s stronger than the conjecture that there are primes of every Hamming weight.

Posted in Uncategorized | 2 Comments