29-year-old Conway conjecture settled

Ilkka Törmä and Ville Salo, a pair of researchers at the University of Turku in Finland, have found a finite configuration in Conway’s Game of Life such that, if it occurs within a universe at time T, it must have existed in that same position at time T−1 (and therefore, by induction, at time 0). Here is the configuration of live and dead cells, surrounded by an infinite background of grey “don’t care” cells:

The configuration was discovered by experimenting with finite patches of repeating ‘agar’ and using a SAT solver to check whether any of them possess this property. Similarly, one can use a SAT solver to verify that Törmä and Salo’s result is correct.

Since this configuration can be stabilised (by the addition of further live cells, shown in yellow) into a finite still-life, this demonstrates that not every still-life can be constructed by colliding gliders.

The first finite stabilisation was 374 cells, but this was promptly reduced to 334 cells by Danielle Conway and then to the 306-cell configuration above by Oscar Cunningham. Oscar moreover proved, again using SAT solvers, that this is the minimum-population stabilisation of the Törmä-Salo configuration.

Consequently, we have the following pair of bounds:

  • Every strict still-life with ≤ 20 cells can be synthesised by gliders.
  • There exists a strict still-life with 306 cells that cannot be synthesised.

More importantly, the Törmä-Salo result positively answers a question first posed by John Conway himself on 24th August 1992:

The things buildable by gliders (an idea I think first popularized
by Buckingham) are a nice class, mainly because they are provably of
infinite “age” (at least if you define them correctly). I’m sure we
should NOT believe, however, that everything of infinite age is so
buildable (even if most of us do). I expect that there is a still life
of such delicacy that in some essential sense it is its only ancestor –
though obviously that sense must allow for fading configurations outside
it, and probably allow for more.

This brings me to an interesting point – the false lessons experience
might teach us. Experience is a bad guide to large configurations – it
teaches us perhaps that there is no orphan, that almost all configurations
die down pretty soon – whereas almost all configurations ARE orphans, of
course, and PROBABLY almost all configurations grow infinitely, as you
asserted in your note, but I’m sure not meaning that it was provably true.

A non-constructible
Sorry – A non-(glider-)constructible configuration might be something
that’s almost an orphan, in that it can only arise from a similar
configuration at the previous time, which itself can only arise from … .

Indeed, is there a Godlike still-life, one that can only have existed
for all time (apart from things that don’t interfere with it)? I like
this one! I imagine it might be findable too, by a version of the searches
that found the old orphans (gardens-of-eden), but restricted to still-lifes.

Well, I’m going out to get a hot dog now, so will stop this. It was
originally intended to be only a very much shorter thank-you note, and so
was addressed only to you – please circulate it if you like. JHC

The construction also implies a solution to the generalised grandfather problem: a pattern which has an N-tick predecessor but not an (N+1)-tick predecessor. The diameter of such a pattern grows like Θ(sqrt(log(N))).

Previous results were known for small values of N (N=0 by Roger Banks, and N=1,2,3 by mtve). Recently Törmä and Salo settled the problem for all positive integers, but the diameter of the pattern implied by their proof grows like Θ(N). A few days later they discovered the pattern in this post, which implies the stronger (and indeed optimal up to a constant factor) result above.

In other GoL-related news:

  • David Raucci discovered the first oscillator of period 38. The remaining unsolved periods are 19, 34, and 41.
  • Darren Li has connected Charity Engine to Catagolue, providing approximately 2000 CPU cores of continuous effort and searching slightly more than 10^12 random initial configurations per day.
  • Nathaniel Johnston and Dave Greene have published a book on Conway’s Game of Life, featuring both the theoretical aspects and engineering that’s been accomplished in the half-century since its conception. Unfortunately it was released slightly too early to include the Törmä-Salo result or Raucci’s period-38 oscillator.
Posted in Uncategorized | 17 Comments

Training a random Gaussian generator

I’ve spent the last couple of months tackling the problem of designing an algorithm to rapidly generate high-quality normally-distributed pseudorandom numbers on a GPU. Whilst this may seem quite pedestrian, it turned out to be much more interesting than I’d anticipated: the journey involved Hermite polynomials, generating functions, badly approximable numbers, quasi-Newton methods, and nearest-neighbour spatial queries.

Traditional methods

There are a few well known methods for producing normally-distributed random numbers from uniformly-distributed inputs.

The simplest approach is to just generate a uniformly-distributed random variable and then apply the inverse CDF of the Gaussian distribution. Unfortunately, the inverse CDF of the Gaussian distribution is not easy to compute; it’s a transcendental function with no closed form, so evaluating it on a computer is difficult: you’d need to use a Taylor series or a Padé approximant with enough terms that the output distribution is sufficiently close (in terms of KL-divergence) to a true Gaussian distribution such that they cannot be easily distinguished.

The other problem with this approach is that the output is a deterministic function of a single uniform input. If the input is a random 32-bit integer, then this generator would only be capable of generating 2^32 different output values.

Even if the input is a random 64-bit integer, it would still have a major problem: any value that it can produce must occur with probability at least 1/2^64. If we generated 2^80 random outputs, then there would be far too many (~ 2^16) outside the interval [−9.5, 9.5] or far too few (exactly zero). On the other hand, a true Gaussian generator should produce an expected 2537 outputs with magnitude greater than 9.5.

A commonly used alternative is the Box-Müller method, which takes a pair of uniform random variables and emits a pair of independent standard Gaussians. It uses the fact that if (x, y) is a bivariate standard Gaussian, then the polar coordinates (r, θ) are independent and have simpler distributions: r² follows an exponential distribution and θ is uniformly distributed in [0, 2π].

Whilst it doesn’t involve anything as exotic as the inverse CDF of a Gaussian distribution, it still requires the computation of a logarithm and a square-root (to determine r) and trigonometric functions (to transform from polar coordinates to Cartesian coordinates), so it is also rather computationally expensive.

There are also algorithms based on rejection sampling, such as Marsaglia’s ziggurat algorithm, which are very performant on scalar architectures (such as CPUs) but less so on vector architectures (such as GPUs) because of the presence of conditional branching.

Algorithms based on summing independent random variables

If you were to neglect the cost of generating the uniform random numbers themselves, then an especially cheap (but low-quality) approximation to a standard Gaussian is obtained by taking 12 random U[0,1] numbers and computing their alternating sum:

A − B + C − D + E − F + G − H + I − J + K − L

This works relatively well, but not fantastically (you can visually distinguish between this probability density function (Irwin-Hall) and the counterpart for a true standard Gaussian):

Why is it such a good approximation? Simply put, each uniform random variable has a variance of 1/12, so summing them produces a unit-variance output, and the use of an alternating sum ensures that the output mean is 0. The central limit theorem tells us that summing lots of (finite-variance!) i.i.d. random variables together and rescaling will approximate a Gaussian, and 12 is reasonably close to ‘lots’.

We could improve the quality of the generator by using more uniform inputs, but in reality they’re not free: they have to be produced by a pseudorandom number generator such as PCG or xorwow. Another problem is that it’s far too thin-tailed: the kurtosis of the output is detectably less than 3 (platykurtic), and the number of outputs you need to sample to be sure of that only grows like the square of the number of independent uniforms that you’re summing: so you’d need to sum on the order of 2^20 independent uniforms per output if you want the generator to be able to produce 2^40 outputs before failing the test. Naturally, generating a million uniform random numbers to produce a single output is extremely costly, so the Irwin-Hall method is insufficient for our purposes.

Wallace’s method

Christopher Wallace proposed a stateful method of generating Gaussians, which admits efficient vectorised implementations such as this paper by Brent.

The basic idea is that if you have a set of n i.i.d. Gaussians, then applying an orthogonal matrix in SO(n) produces a different set of n i.i.d. Gaussians (which, of course, are not independent from the first set). Wallace’s generator maintains an internal ‘pool’ of n i.i.d. Gaussians and has the following state-update and output functions:

  • State update: apply a sparse random orthogonal matrix to the pool. This is often composed of a permutation and a block-diagonal matrix of Hadamard matrices. Either the permutation or the block-diagonal matrix of Hadamard matrices should be randomised (i.e. depend on bits emitted by a pseudorandom number generator) to ensure that the orthogonal matrix is unpredictable.
  • Output function: project down by taking a proper subset of the n Gaussians in the pool. Scale all of these by (the same) random χ²-distributed correction term to correct for the fact that the orthogonal matrix preserves the sum of squared Gaussians in the pool.

To overcome the sparsity of the state-update function and obtain superior mixing properties, one could repeat the state-update function several times between each consecutive output of the generator.

A new algorithm

The key ingredient of Wallace’s generator is the random orthogonal linear transformation that’s applied across the warp to update the state. I then had an idea: what if we use a Wallace-like orthogonal linear transformation not as a state-update function, but as an output function?

As with approaches such as the ziggurat method and Box-Muller, we presuppose that we have a good pseudorandom generator of uniform 32-bit integers, such as Melissa O’Neill’s PCG, and concentrate on the problem of converting this uniformly-distributed input into a normally-distributed output.

In particular, we propose the following architecture:

  • each of the 32 threads in the warp generates a uniform pseudorandom 32-bit integer;
  • each thread uses some of these input bits to sample two independent weak approximations to a Gaussian, for a total of 64 random variables;
  • a random orthogonal linear transformation is applied across the warp to this set of 64 random variables (viewed as a vector in R^64);
  • we project down to R^32 by linearly combining the two values in each thread to produce one double-precision floating-point value per thread;
  • a small uniform random perturbation is added to the output.

The weak approximations are loaded from a lookup table (using 8 bits of input randomness to load the absolute value from a 256-element lookup table, and a 9th bit to determine the sign). It is necessary to use this lookup table for the weak approximations rather than, say, a uniform distribution, because then our distribution would again be too platykurtic. An alternative could be to use Box-Muller to generate the input distributions, which would be high quality but expensive to compute.

The purpose of the orthogonal linear transformation is twofold: firstly, by the central limit theorem, the marginal distribution of the output is much closer to a Gaussian than the ‘weak approximations’ that we’ve linearly combined; secondly, it aggregates randomness from all 32 of the threads in the warp, so that each output is dependent on 1024 random input bits instead of just 32.

Each of the outputs can each be considered to be a weighted sum of independent random variables, similar to the Irwin-Hall approach where 12 uniform distributions are summed. The orthogonal linear transformation extracts multiple such weighted sums in parallel from the same set of input random variables; the results are non-independent by construction, but the orthogonality of the matrix means that they’re still (linearly) uncorrelated.

Because our approach is stateless (we sample fresh random variables and then apply an orthogonal transformation), we don’t require the annoying χ²-distributed correction term present in Wallace’s generator.

We make a few decisions for efficiency:

  • Everything before the final linear projection is done in 32-bit integer arithmetic, with the final linear combination happening in full double-precision floating-point arithmetic.
  • The lookup table resides in CUDA shared memory.
  • For perfect bank conflict avoidance, thread n only accesses elements of the shared memory array congruent to n (mod 16).
  • As such, the shared memory array requires 4096 entries (16 KB), consisting of 16 interleaved lookup tables of 256 entries each.
  • There is no reason for these 16 lookup tables to have identical contents, meaning that different residue classes (mod 16) of threads can load different distributions. It transpires that this flexibility does indeed help.

Our random orthogonal linear transformation is a product of sparse matrices, so as to limit the amount of communication between threads. Rather than describe it verbally, it is easiest to draw a dataflow schematic of what happens in each thread. Each of the yellow diamonds is a ‘conditional negate’ operation, which either performs the identity operation or negates its input with probability ½.

This can be viewed as consisting of 5 copies of the following layer, stacked vertically:

(Technically, if you stacked two of these layers together verbatim, then the ‘a‘ register would have two successive yellow diamonds between the two layers. However, a pair of successive probability-½ negations is functionally equivalent to a single probability-½ negation, so we can ‘fuse’ the two diamonds into one.)

What does this layer do? It uses 3 random bits to uniformly randomly select one of the eight 2×2 Hadamard matrices enumerated below and multiplies it by the input vector (a, b):

These are, ignoring the scale factor of sqrt(2), precisely the orthogonal transformations where each coordinate of the output depends equally on each coordinate of the input; that is to say, they are ‘maximally mixing’. Omitting the scale factor of sqrt(2) means that this is implementable just using addition/subtraction of integers, and is exact (no rounding errors).

The whole warp performs an FFT-style butterfly network of independent random 2×2 Hadamard matrices; by the end of that process, the ‘a‘ register of a thread contains an equal mix of the 32 input variables from that thread’s half-warp; the ‘b‘ register contains an equal mix of the 32 input variables from the other half-warp.

Linear combination

Finally, the values in the two registers are linearly combined (in 64-bit floating point, where one register contributes 5/9 of the variance and the other contributes 4/9 of the variance) and an additional low-weight uniform random variable c is added for additional smoothing.

Why are the variance proportions 4/9 and 5/9? They were chosen because the coefficients in the linear combination, sqrt(2)/3 and sqrt(5)/3, are in the ratio 1 : (φ − ½), which is difficult to approximate with rational numbers whilst still keeping the contributions roughly equal in magnitude.

If the ratio were rational, say in the ratio p : q with small integer coefficients, then the output would be (ap + bq)/f, so the output could only take on discrete integer multiples of f. As such, the output distribution could be distinguished from a Gaussian by looking at the Fourier transform of the distribution — it would have a massive peak at the frequency f and multiples thereof.

The final uniform random variable c is added with a low weight at the very end to further smooth out any high-frequency effects. This is especially important when the output is close to 0, because the absolute precision of an IEEE floating-point number is higher in the neighbourhood of 0; this final perturbation helps to randomise the low-order bits of the mantissa of the output.

How not to populate the lookup tables

My initial idea for populating the 4096-element lookup table was to take the values of the inverse CDF of the Gaussian distribution sampled at 4096 equally spaced points in the interval [½, 1]. This, after all, seemed like a natural way to produce a ‘discrete approximation’ to the Gaussian distribution. This produced the following initial attempt (each row is one of the 16 interleaved distributions).

This produced underwhelming results: after generating 2^38 Gaussians using this method and computing the empirical nth moment (mean of X^n) for the first 8 positive integers n, the higher even moments already fail a z-test (we know what the true moments are for a Gaussian, as well as their standard errors).

[----------] 1 test from Gaussian
[ RUN ] Gaussian.Moments
274877906944 trials completed.
moment 1: discrepancy = -2.79812e-06; value = -2.79812e-06
moment 2: discrepancy = -2.41684e-06; value = 0.999998
moment 3: discrepancy = -1.46413e-05; value = -1.46413e-05
moment 4: discrepancy = -0.000106436; value = 2.99989
moment 5: discrepancy = -5.0672e-05; value = -5.0672e-05
moment 6: discrepancy = -0.00155006; value = 14.9984
moment 7: discrepancy = 0.00010665; value = 0.00010665
moment 8: discrepancy = -0.0224637; value = 104.978
[ OK ] Gaussian.Moments (322261 ms)
[----------] 1 test from Gaussian (322261 ms total)

Clearly, we need better ideas.

Optimising the lookup table

The key idea is to note that sampling those 2^38 random Gaussians was actually unnecessary, because we can analytically compute the moments of the distribution produced by our generator. For this initial attempt, we can see that the moment test which fails first is the 8th moment, where the expected time to reach a 4-sigma statistic is 9.1 × 10^10.

However, this realisation that we can analytically compute the expected time to fail the moment tests is our salvation: given a loss function that’s a differentiable function of the 4096 numbers in the lookup table, we can use gradient descent to tweak those numbers to optimise the loss function! We’ll initialise the optimisation with our poor initial attempt and let gradient descent do the rest.

What loss function do we choose? Intuitively, we want the reciprocal of the minimum expected time to fail any of the moment tests. It is, however, important to note that the moments are just expectations E[X^n] of monomials for different powers n, and it may well be the case that there’s a polynomial p such that E[p(X)] provides a more effective test — that is to say, a linear combination of moments is more effective than any individual moment.

Up until now, we’ve been looking at moments, which are the expectations of the elements of the monomial basis {1, X, X^2, X^3, …}. It turns out to be much more elegant to use a different basis, namely the Hermite polynomials:

  • He_0(X) = 1
  • He_1(X) = X
  • He_2(X) = X^2 − 1
  • He_3(X) = X^3 − 3X
  • He_4(X) = X^4 − 6X^2 + 3

In particular, they are orthogonal polynomials with respect to a standard Gaussian: the expected value E[He_m(X) He_n(X)] is zero when m and n are different, and is n! otherwise. Other than the 0th Hermite polynomial, these all have zero mean, and the covariance matrix Σ is the diagonal matrix whose entries are factorials.

The word ‘moments‘ exists to describe the expected values of powers of a random variable; for brevity, we’ll analogously use ‘hermites‘ (yes, as a plural common noun!) to refer to the expected values of Hermite polynomials of a random variable.

That provides a very convincing choice of loss function: we take log(x^T P x), where P = inv(Σ) is the (conveniently diagonal!) precision matrix of the vector of the first n non-constant even Hermite polynomials under a standard Gaussian assumption, and x is the vector of the first n non-constant even hermites of our approximation of a Gaussian distribution. The inner product x^T P is proportional to the reciprocal of the worst time to failure of any polynomial test of degree <= 2n.

It turned out to be more numerically viable to optimise the loss function for the simple convolution of the 16 base distributions, even though the actual output of our generator involves more convolutions (the output depends on 64 random variables, 4 from each of the 16 base distributions). These extra convolutions only make the generator better, not worse, so you can view the loss function we’re using as a conservative bound on the generator’s true quality.

We consider a slightly generalised loss function, log(x^T D x) where D is an arbitrary positive-definite diagonal matrix which specifies the relative importance of the different Hermite polynomials.

I used PyTorch as the automatic differentiation framework, so that I only need to implement the forward function and not care about differentiating it myself. PyTorch is chiefly used for training neural networks, where the loss function is too complicated to be symbolically differentiated; instead, it uses ‘reverse-mode’ automatic differentiation (backpropagation).

In order to implement this loss function in PyTorch, it’s necessary to be able to do two things:

  • Compute the hermites of a discrete distribution;
  • Express the hermites of a linear combination of independent random variables in terms of the hermites of those random variables.

The first of these is easy with the recurrence relation for Hermite polynomials. The second is more involved, and ultimately required discovering what appears to be a new result:

Proposition: Let a, b, x, y \in \mathbb{R} with a^2 + b^2 = 1. Then we have:

He_n(ax + by) = \sum\limits_{k=0}^n \binom{n}{k} a^k b^{n-k} He_k(x) He_{n-k}(y)

(Observe that when a = b = \frac{1}{2} \sqrt{2}, this proposition is equivalent to G. Colomer’s first identity mentioned on the MathWorld page. [EDIT: Thanks to Blinkerspawn for pointing out the typo here; I’d originally written sqrt(2) instead of sqrt(2)/2])

Proof: By expanding the binomial coefficient in terms of factorials and pulling out the factor of n!, this is seen to be equivalent to saying that the sequence {He_n(ax + by) / n!} is the convolution of the sequences {a^n He_n(x) / n!} and {b^n He_n(y) / n!}. This is equivalent to saying that the exponential generating function of {He_n(ax + by)} is the product of the exponential generating functions of {a^n He_n(x)} and {b^n He_n(y)}. This generating function is already known, so we can equivalently express this as:

e^{(ax + by)t - \frac{1}{2} t^2} = e^{xat - \frac{1}{2}(at)^2} e^{ybt - \frac{1}{2}(bt)^2}

which is indeed true when a^2 + b^2 = 1. QED.

Using this proposition, we can express the hermites of the output distribution (and therefore our loss function) in terms of the hermites of the base distributions, which are in turn obtained from the raw values in the lookup table by means of the recurrence relation.

Then there’s the choice of optimisation algorithm to use. Neural networks often use either stochastic gradient descent or (more recently) Adam, which are great algorithms when we only have an approximate estimate of the loss function (based on a batch of training data). In our situation, though, we can compute the loss function exactly without any noise (floating-point issues notwithstanding).

As such, we instead use the BFGS algorithm as implemented in Reuben Feinman’s torchmin package. BFGS is a quasi-Newton method, maintaining an estimate of the inverse of the Hessian matrix at the current point (based on the history of gradients observed at previously visited points) and using that to determine the next step to take. It mimics true second-order optimisation algorithms such as Newton’s method, but avoids computing the true inverse Hessian at each step (which is very expensive).

Often people use a limited memory variant called L-BFGS, which is helpful if you don’t have enough memory to store a full Hessian matrix. In our case, there are only 4096 parameters, and a 4096×4096 matrix only occupies a very manageable 128 megabytes, so we opted for the original BFGS algorithm instead.

Local search

After this optimisation had converged (requiring occasional small random perturbations along the way to escape saddle points), there was a major problem: the entries in the lookup table are required to be integers, but the optimisation was performed over the reals!

To convert to integers, we multiply by a scale factor of 2^24 and then round to the nearest integer lattice point. This rounding can be quite damaging to the final loss function, so we then perform a local search on the vertices of the 4096-dimensional unit hypercube centred on the unrounded scaled vector, moving to the best vertex within a particular radius of the current vertex until we reach a local optimum.

Our local search algorithm is as follows:

  1. begin with a starting point (the nearest integer lattice point to the BFGS solution);
  2. set r = 1;
  3. for each of the 16 base distributions, try changing r coordinates. If the best improvement (across all 16 (256 choose r) possibilities) reduces log(x^T D x) by at least 0.001, then move to that new point and jump to step 2;
  4. if no sufficient improvement was found, increase r (up to a maximum of 5) and jump to step 3;
  5. if no improvement was found at a local search of depth 5, terminate.

Since x^T D x is a positive-definite quadratic function of x, which is a linear function of each base distribution (when the other 15 distributions are held fixed), we can avoid evaluating (256 choose r) possibilities by instead performing a ball-tree-assisted nearest-neighbour Euclidean search with (256 choose floor(r/2)) query points and (256 choose ceil(r/2)) index points. We use the off-the-shelf implementation in scikit-learn. This gives an approximately quadratic speedup over brute-force, and makes the local search run relatively quickly (for each iteration, the search over all 16 base distributions takes a total of approximately 2 seconds for r = 3, 15 seconds for r = 4, and 90 seconds for r = 5).

Applying BFGS followed by local search to our initial inverse-CDF-based attempt, we arrive at an unusual set of distributions which yield a considerably better approximation to a Gaussian when convolved:

Choosing the coefficients of a, b, and c

Suppose that we have already populated the lookup tables as detailed in the previous section. Then we can determine the three coefficients by imposing the following constraints:

  • the ratio of the coefficients of a and b are sqrt(5) : 2;
  • the variance of the output distribution is 1;
  • the kurtosis of the output distribution is 3;

where each constraint is understood to mean ‘as close as possible to within floating-point error’.

Since the uniform perturbation is platykurtic (kurtosis less than 3), the random variables a and b need to be slightly leptokurtic (kurtosis greater than 3) in order for this system of constraints to admit a solution. If we can arrange for that to happen, then the output distribution is symmetric (so all odd-order moments are zero) and the 2nd and 4th moments are also correct.

Consequently, when performing the local search mentioned in the previous section, we added a barrier penalty to the loss function to force the convolution of the 16 distributions to be leptokurtic.

The parameters obtained in this manner result in a generator much better than our original attempt: it takes 1.6 × 10^30 outputs in order to fail a moment test at the 4-sigma level, up from 9.1 × 10^10 (the analogous number for the unoptimised distribution).

CUDA implementation

Here is the source code for the generator, minus the contents of the lookup table and the coefficients of a, b, and c in the linear combination. This matches the schematic detailed above, but is more complete as it shows how the random lookups, conditional negation, and final uniform random perturbation are obtained from the 32 bits of input randomness.

#include "stdint.h"
#define _DI_ __attribute__((always_inline)) __device__ inline
#define shuffle_xor(x, y) __shfl_xor_sync(0xffffffffu, (x), (y))
#define COND_NEGATE(i, x) if (entropy & (1u << (i))) { x = -x; }
#define HMIX(i) { int32_t s = a + b; a -= b; b = shuffle_xor(s, i); }

// load the parameters for the random Gaussian generator
#include "parameters.h"

/**
 * Creates a double-precision Gaussian random variable with
 * specified mean and standard deviation. The 'entropy' argument
 * should be a uniform random uint32.
 */ 
_DI_ double warpGaussian(uint32_t entropy, const int32_t* smem_icdf,
    double mu=0.0, double sigma=1.0) {

    // bank conflict avoidance strategy:
    int laneId = threadIdx.x & 15;

    // use 16 bits of entropy to retrieve two random variables
    // (so the entire warp has 64 such random variables).
    // ....bbbbbbbb........aaaaaaaa....
    int32_t a = smem_icdf[(entropy & 4080) | laneId];
    int32_t b = smem_icdf[((entropy >> 16) & 4080) | laneId];

    // perform the first three layers:
    COND_NEGATE(19, a) COND_NEGATE(18, b)
    HMIX(1)
    COND_NEGATE(17, a) COND_NEGATE(16, b)
    HMIX(2)
    COND_NEGATE(15, a) COND_NEGATE(14, b)
    HMIX(4)
    COND_NEGATE(13, a) COND_NEGATE(12, b)

    // create a uniform random odd int32 (centred on zero).
    int32_t c = (entropy ^ b) | 1;

    // use the lowest 4 bits of entropy (those that contributed the
    // least to the uniform variable c) for the last two layers:
    HMIX(8)
    COND_NEGATE(3, a) COND_NEGATE(2, b)
    HMIX(16)
    COND_NEGATE(0, a) COND_NEGATE(1, b)

    // at this point, a is a {+1,-1}-weighted sum of the 32 values
    // from this half-warp; b is a {+1,-1}-weighted sum of the 32
    // values from the other half-warp.

    double result = mu;

    // our output is the combination of two high-variance Gaussian
    // components and one low-variance uniform component:
    {
        double a_scale = sigma * ordinary_params[0];
        double b_scale = sigma * ordinary_params[1];
        double c_scale = sigma * ordinary_params[2];
        result += a * a_scale;
        result += b * b_scale;
        result += c * c_scale;
    }

    return result;
}

The only ‘surprise’ here should be the way that the uniform random variable c is obtained: it’s the result of XORing the 32 bits of input randomness with the register b (which has just been freshly ‘shuffled in’ from other threads in the warp, so is independent, and because its distribution is symmetric it remains independent even when we conditionally negate it) and setting the last bit to 1.

Both of these Boolean operations will be fused together in practice, as the CUDA compiler combines them into a ‘LOP3’ instruction, which applies an arbitrary 3-input 1-output bitwise Boolean function specified by a truth table in the form of an 8-bit immediate operand inside the compiled machine code.

Why set the least significant bit to 1? Otherwise, it would be negatively biased (with a mean of −0.5) because the representable 32-bit signed integers are:

{−2^31, −(2^31 − 1), …, −2, −1, 0, 1, 2, … 2^31 − 1}

and the first of these values (bit pattern 0x80000000) ruins the symmetry of the set of representable integers. By setting the least significant bit to 1, we instead have a uniform distribution over the following set of odd integers:

{−(2^31 − 1), …, −5, −3, −1, 1, 3, 5, … 2^31 − 1}

which is symmetrical about 0. We can alternatively view this random variable c as the convolution of 31 Bernoulli random variables with supports:

{−1, 1}, {−2, 2}, {−4, 4}, {−8, 8}, …, {−2^30, 2^30}

which is helpful for analytically computing its moments. (The moments of a small finite distribution are easy to compute, and we can compute the moments of a convolution of two distributions knowing the moments of the original distributions. Because we have to use this idea anyway for the random variables a and b, we may as well use it again for c instead of having to analytically sum a polynomial-valued series.)

We have ignored the subtlety that a and c are not necessarily independent. (b is independent from each of a and c because it originates from the other half-warp.) However, a is conditionally negated according to the least significant bit of entropy, which does not feature in the computation of c (because it was masked out when we bitwise-OR’d the variable with 1), so it follows that a and c have zero linear correlation and also that the output distribution is still symmetrically distributed. Any nonlinear dependence between a and c can only therefore manifest in the 4th moment (and beyond) of the output distribution, where its effect will be negligible because the scale factor applied to c is so small.

Issues with small floats

Let’s assume for the moment that we’re generating standard Gaussians, for which μ = 0 and σ = 1. A slight problem with the generator is that the scale factors of a, b, and c are all multiples of 2^−95, which means that the output is necessarily a multiple of 2^−95.

This only poses an issue for very small outputs: for example, if an output is in the interval [−2^−43, 2^−43], then the lowest bit of the mantissa is forced to be zero. Once we’ve collected (say) 20 outputs in this range, which takes an expected 2.2 × 10^14 samples, then we can be pretty certain that the generator is flawed.

This can be remedied by expressing the scale factor for c as the sum of two double-precision floats instead of as one double-precision float. Then, the last part of our implementation will resemble this:

double result = mu;

{
double a_scale = sigma * ordinary_params[0];
double b_scale = sigma * ordinary_params[1];
double c_scale_hi = sigma * ordinary_params[2];
double c_scale_lo = sigma * ordinary_params[3];
result += a * a_scale;
result += b * b_scale;
result += c * c_scale_hi;
result += c * c_scale_lo;
}

return result;

In doing so, the minimum ‘quantization’ is reduced from 2^−95 to 2^−150, which means that it will take 7.9 × 10^30 outputs to detect the flaw, comparable in magnitude to the moment tests.

It should be stressed that the additional fused multiply-add will slightly slow down the speed of the generator, and it is unlikely to matter for most practical applications of Gaussian random numbers (such as Monte Carlo simulations) where the presence of floating-point rounding errors in downstream calculations are likely of greater effect than the reduced relative precision in the neighbourhood of zero.

Testing with PractRand

Although we have analytically determined that the generator won’t fail any of the moment tests within 10^30 outputs, suggesting that the tail behaviour is correct, there could be other unforeseen problems in the generator. For example, had the coefficients of a, b, and c in the final linear combination been small integer multiples of some common divisor, there would be high-frequency issues with the generated distribution even though the tail behaviour could still be correct.

Consequently, I decided to test this using the PractRand tests following the instructions detailed here. In addition to the suggested patch by Daniel Lemire, it was necessary to change the return type in the declaration of show_checkpoint from ‘double’ to ‘void’, as the function doesn’t actually return anything and was segfaulting when I tried to run it.

PractRand tests that the input is uniformly distributed, whereas the output of the Gaussian generator is supposed to be normally distributed. To test the correctness of the output distribution, it is therefore necessary to apply the CDF of a standard Gaussian distribution to transform it into a supposedly-uniform random variable:

double gaussian = output_host_gauss[i];
double uniform = std::erf(gaussian * 0.7071067811865476);
res[i] = (uint32_t) ((1.0 + uniform) * 2147483648.0);

The error function, erf, produces a uniform random variable in the open interval (−1, 1) when given an input whose probability density function is proportional to exp(x²). Since the probability density function of a standard Gaussian is exp(½x²), it is necessary to divide the standard Gaussian by sqrt(2) before feeding it into the error function. Originally I forgot this factor of sqrt(2), and the PractRand tests reassuringly failed (by a huge margin, with p-values less than 10^−1000) at the first checkpoint.

Comparison of the error function, the affine-transformed error function, and the CDF of a standard Gaussian

Then, to convert this into a uniformly random uint32, we apply an affine transformation that maps (−1, 1) to (0, 2^32) and then apply the floor function to the result (implicitly, by casting to an unsigned 32-bit integer).

After running PractRand for approximately four days, the following output was obtained:

$ test/gpu/gpu_gaussian | PractRand/RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0xc7ea1bec
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 256 megabytes (2^28 bytes), time= 2.7 seconds
no anomalies in 124 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 512 megabytes (2^29 bytes), time= 5.6 seconds
no anomalies in 132 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 1 gigabyte (2^30 bytes), time= 11.5 seconds
no anomalies in 141 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 2 gigabytes (2^31 bytes), time= 22.7 seconds
Test Name Raw Processed Evaluation
[Low1/32]FPF-14+6/16:(4,14-2) R= +6.3 p = 2.7e-5 unusual 
...and 147 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 4 gigabytes (2^32 bytes), time= 45.0 seconds
no anomalies in 156 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 8 gigabytes (2^33 bytes), time= 89.9 seconds
Test Name Raw Processed Evaluation
Gap-16:B R= +4.0 p = 2.5e-3 unusual 
...and 164 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 16 gigabytes (2^34 bytes), time= 178 seconds
no anomalies in 172 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 32 gigabytes (2^35 bytes), time= 353 seconds
no anomalies in 180 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 64 gigabytes (2^36 bytes), time= 707 seconds
no anomalies in 189 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 128 gigabytes (2^37 bytes), time= 1406 seconds
Test Name Raw Processed Evaluation
[Low1/32]DC6-9x1Bytes-1 R= -4.2 p =1-6.2e-3 unusual 
...and 195 test result(s) without anomalies

rng=RNG_stdin32, seed=0xc7ea1bec
length= 256 gigabytes (2^38 bytes), time= 2785 seconds
no anomalies in 204 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 512 gigabytes (2^39 bytes), time= 5625 seconds
no anomalies in 213 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 1 terabyte (2^40 bytes), time= 11165 seconds
no anomalies in 220 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 2 terabytes (2^41 bytes), time= 22407 seconds
no anomalies in 228 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 4 terabytes (2^42 bytes), time= 45386 seconds
no anomalies in 237 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 8 terabytes (2^43 bytes), time= 90285 seconds
no anomalies in 244 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 16 terabytes (2^44 bytes), time= 213575 seconds
no anomalies in 252 test result(s)

rng=RNG_stdin32, seed=0xc7ea1bec
length= 32 terabytes (2^45 bytes), time= 395812 seconds
no anomalies in 260 test result(s)

There are no test failures here, nor anything suspicious. Occasionally certain tests are encountered with p-values around 0.0001 or 0.9999 and marked ‘unusual’, but they are not too surprising since thousands of tests are performed in total: these ‘unusual’ results are merely green jelly beans.

Tail tests and measuring the speed of the generator

The GPU can generate random Gaussians much more quickly than they can be copied over the PCI-express cable and analysed by PractRand on the CPU. Consequently, we consider a modified tail test, where we copy only the Gaussians whose absolute value exceeds 4. We expect the convolutions to smooth out the behaviour of the distribution in the bulk, so the tails are worth testing more thoroughly.

Only one out of 15787 Gaussians are expected to fall in these tails, so we can better match the relative throughput of the Gaussian generation on the GPU and PractRand analysis on the CPU.

We account for this when mapping to a uniform distribution:

std::vector<uint32_t> retrieve() {
    pcie();
    run_kernel();

    std::vector<uint32_t> res(output_host_count[0]);

    for (size_t i = 0; i < res.size(); i++) {
        double gaussian = output_host_gauss[i];
        double uniform = std::erf(gaussian * 0.7071067811865476);

        if (uniform > 0) {
            uniform -= 0.9999366575163338;
        } else {
            uniform += 0.9999366575163338;
        }

        // map to interval (0, 2)
        uniform = uniform * 15787.192767323968 + 1.0;

        res[i] = (uint32_t) (uniform * 2147483648.0);
    }

    return res;
}

Running this on a Volta V100 produces the following output:

$ test/gpu/gpu_gaussian_tail | PractRand/RNG_test stdin32
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x6f2e57d4
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 128 megabytes (2^27 bytes), time= 4.2 seconds
no anomalies in 117 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 256 megabytes (2^28 bytes), time= 9.1 seconds
no anomalies in 124 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 512 megabytes (2^29 bytes), time= 17.8 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-1,T) R= -6.5 p =1-2.9e-3 unusual 
...and 131 test result(s) without anomalies

rng=RNG_stdin32, seed=0x6f2e57d4
length= 1 gigabyte (2^30 bytes), time= 35.7 seconds
no anomalies in 141 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 2 gigabytes (2^31 bytes), time= 70.9 seconds
no anomalies in 148 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 4 gigabytes (2^32 bytes), time= 140 seconds
no anomalies in 156 test result(s)

In particular, it takes 140 seconds to produce 4 gigabytes of output, or equivalently 2^30 outputs (they’re converted to uint32s before piping into PractRand). However, these 2^30 outputs were subsampled from approximately 15787 times as many Gaussians (because we did the 4-sigma tail threshold), so the GPU is actually generating 121 billion Gaussians per second.

121 billion double-precision floats together occupy 968 GB, so it’s producing Gaussian random numbers faster than the Volta can even load/store to global memory (900 GB/s). In other words, it’s faster to run this (in whatever Monte Carlo simulation kernel is consuming the random Gaussians) than it is to load precomputed random Gaussians from the GPU’s global memory.

The GPU is running at full utilisation and PractRand is only showing a long-term average of 37% CPU usage, so (thanks to the tail thresholding) we’re no longer limited by the speed of PractRand. Let’s check up on the PractRand results:

rng=RNG_stdin32, seed=0x6f2e57d4
length= 8 gigabytes (2^33 bytes), time= 281 seconds
no anomalies in 165 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 16 gigabytes (2^34 bytes), time= 562 seconds
no anomalies in 172 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 32 gigabytes (2^35 bytes), time= 1116 seconds
no anomalies in 180 test result(s)

So far this has consumed about 20 minutes on an AWS p3.2xlarge instance, so roughly $1 of compute to generate a total of 136 trillion Gaussians (7 femtodollars per Gaussian!) and test the approximately 8 billion of those that land in the 4-sigma tails.

Consequently, to run PractRand to its default limit of 2^45 bytes on this tail test would take a fortnight and cost roughly $1000, so I’m loath to run it all the way to completion. Let’s instead stop at two terabytes, which takes nearly 20 hours ($60 on AWS):

rng=RNG_stdin32, seed=0x6f2e57d4
length= 64 gigabytes (2^36 bytes), time= 2238 seconds
no anomalies in 189 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 128 gigabytes (2^37 bytes), time= 4468 seconds
no anomalies in 196 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 256 gigabytes (2^38 bytes), time= 8899 seconds
no anomalies in 204 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 512 gigabytes (2^39 bytes), time= 17848 seconds
no anomalies in 213 test result(s)

rng=RNG_stdin32, seed=0x6f2e57d4
length= 1 terabyte (2^40 bytes), time= 35668 seconds
Test Name Raw Processed Evaluation
[Low1/32]Gap-16:A R= -3.9 p =1-2.5e-3 unusual 
...and 219 test result(s) without anomalies

rng=RNG_stdin32, seed=0x6f2e57d4
length= 2 terabytes (2^41 bytes), time= 71144 seconds
no anomalies in 228 test result(s)

Conclusions and acknowledgements

Between these empirical tests and theoretical analysis, it appears that this is a suitable method of generating non-cryptographic pseudorandom normally distributed random variables on a GPU, particularly suited for computationally intensive scientific and industrial applications such as Monte Carlo simulations.

It should not be used for cryptographic applications: there has been no attempted cryptanalysis of the output function, and based on the simplicity of the construction it is likely to be vulnerable to cryptanalytic attacks. Without the linear projection at the end, it would be especially trivial to invert; given the final a and b registers of all threads in a warp, it’s possible to determine (with about 2^32 effort) the random Hadamard transformation, the values loaded from the lookup table, and therefore the lower 28 bits of the input entropy.

If you need cryptographic random Gaussians (for RLWE, for instance) then you should use Dwarakanath and Galbraith’s paper instead.

Finally, many thanks go to Rajan Troll for relevant discussions over the last two months that immensely helped with developing this random Gaussian generator.

Posted in Uncategorized | Leave a comment

Involutions on a finite set

Suppose that R is the (assumed to be finite) set of solutions to a certain problem, and you’re interested in determining the parity of |R|. The following proof strategy works surprisingly often, namely in at least two different scenarios, and leads to very elegant proofs:

  • Embed R in a larger set S equipped with an involution f : S → S, such that R is precisely the set of fixed points of f;
  • Define a second involution g : S → S, such that the set T of fixed points of g is easily seen to have the correct parity (usually by having 0 or 1 elements).

Then we have |R| = |S| = |T| (mod 2).

We’ll discuss two examples of this proof technique.

Example I: Zagier’s one-sentence proof

To prove Fermat’s theorem that every prime of the form 4k + 1 is expressible as the sum of two squares, Zagier famously provided the following ‘one-sentence proof’ discussed here:

Here, g is the complicated piecewise involution defined above, and f is the involution which interchanges the second and third coordinates. The set R is the set of ways to write p = x^2 + 4y^2, and the set U is the singleton set containing the fixed point of the complicated involution g.

Observe that this proof is constructive: you can use these involutions f and g to construct a solution to the original problem! Specifically, if you start at the (unique) element of U and alternately apply f and g, the other end of the path is an element of R:

Christian Elsholtz similarly noticed this construction here albeit with the Greek letters α and β to refer to the two involutions.

At first, I though that this was a trick rather than a technique. In the words of Donald Knuth:

A trick is a clever idea that can be used once, while a technique is a mature trick that can be used at least twice.

That was until recently when I saw the following.

Example II: Hamiltonian cycles on cubic graphs

The problem is to show that, given a 3-regular graph G with a distinguished edge x, that there exists an even number of Hamiltonian cycles passing through x. Using this proof approach, we let:

  • S be the set of proper 3-edge-colourings (with colours red, green, and blue) of the graph G such that x is assigned the colour blue;
  • g be the involution that swaps red and green;
  • f be the involution that swaps blue and green on every blue-green component other than the one containing the edge x.

The fixed points of f are those where the green and blue edges form a single (Hamiltonian) cycle. There are no fixed points of g, so U is the empty set. It follows, therefore, that |S| and |R| are also even.

Again, this proof is constructive: by alternately applying f and g, this gives you an explicit self-inverse way to transform any Hamiltonian cycle through x into a different Hamiltonian cycle through x:

This is a natural bijection, in that it doesn’t involve making any arbitrary choices such as labelling the vertices of G. Even better, it’s acceptable for constructivists since we can provide an explicit upper bound on the length of the path of alternating applications of f and g: it’s no longer than the size of S, which is in turn upper-bounded by 3^(3n/2); there are 3n/2 different edges, each of which is either red, green, or blue.

Anyway, does this technique have a name? If not, I’d propose ‘Zagiery’ after the discoverer of the first proof. And are there any other applications besides the two that we’ve looked at?

Posted in Uncategorized | 7 Comments

Hamming backups: a 2-of-3 variant of SeedXOR

EDIT (2021-10-14): I’ve written a reference implementation of the Hamming backup idea introduced in this article.

SeedXOR is an approach for splitting a Bitcoin wallet seed, adhering to the BIP39 standard, into N ‘parts’ (each the same size as the original) which are each a valid BIP39 seed. It is used by the COLDCARD wallet and preferred over more complicated Shamir secret sharing schemes on the basis that:

  • the SeedXOR calculations are simple to do, such that they can be performed with pencil and paper without the aid of a computer;
  • each of the N parts of a SeedXOR backup is itself a valid BIP39 seed, so they can contain ‘decoy wallets’ and thereby steganographically hide the fact that they’re even part of a SeedXOR backup at all!

On the other hand, Shamir secret sharing enables ‘M-of-N’ backups, where only M parts are necessary to reconstruct the original seed (instead of all N parts).

Here we present a ‘2-of-3’ variant of SeedXOR which retains many of the desirable properties of the original SeedXOR:

  • the seed X is used to construct three ‘parts’ (each the same size as the original seed, as in SeedXOR), called A, B, and C;
  • as in SeedXOR, each of these three parts is indistinguishable from a standard BIP39 seed (giving rise to plausible deniability);
  • this is backwards-compatible with SeedXOR, in that if you have all three parts, then you can reconstruct X by simply taking A ⊕ B ⊕ C (and recomputing the checksum);

However, unlike a regular SeedXOR backup, this modified variant allows for full recovery of the original seed even when one of the three parts is missing:

  • given any two of the three parts, the seed X can be recovered.

This new variant is called a ‘Hamming backup‘ because the construction is based on an extended Hamming code. Generating a Hamming backup and recovering a seed from a Hamming backup can be done using only the XOR operation, as with SeedXOR. It’s slightly more complicated than SeedXOR in its construction (which is inevitable, because this implements a 2-of-3 scheme instead of an N-of-N scheme), but still possible to do with pencil and paper using only the XOR operation.

In particular, we firstly regard the 256-bit seed X as the union of two 128-bit values, X1 and X2, and similarly regard the parts A, B, and C as the unions A1+A2, B1+B2, C1+C2. We’ll explain later how to split a BIP39 seed in this manner.

The linear (bitwise XOR) relations that we impose between the different 128-bit values are summarised in the following diagram. (The analogous diagram for ordinary SeedXOR would be a single circle containing N+1 values, namely the seed X together with the N parts A, B, …)

This diagram has three circles, each of which has 4 values inside (in the case of the top-right circle, these are A1, A2, X1, C2) and 4 values outside (in this case, B1, B2, C1, X2). If four values are either all inside the same circle or all outside the same circle, then we impose the relation that they all must bitwise-XOR to zero. For example:

  • A1 ⊕ A2 ⊕ X1 ⊕ C2 = 0
  • B1 ⊕ B2 ⊕ C1 ⊕ X2 = 0

and analogously for the other two circles:

  • B1 ⊕ B2 ⊕ X1 ⊕ A2 = 0
  • C1 ⊕ C2 ⊕ A1 ⊕ X2 = 0
  • C1 ⊕ C2 ⊕ X1 ⊕ B2 = 0
  • A1 ⊕ A2 ⊕ B1 ⊕ X2 = 0

Equivalently, for each of these quartets, any one of the four values is given by the bitwise XOR of the other three.

Further linear relations can be derived from combining existing relations; for example, X1 = A1 ⊕ B1 ⊕ C1 and X2 = A2 ⊕ B2 ⊕ C2. Indeed, there is a group-like closure property: if you take any 3 of the 8 values in the diagram and bitwise-XOR them together, the result of the calculation is one of the 8 original values.

We’ll now discuss how these relations enable the seed X to be recovered from any two of the three parts {A,B,C} of a Hamming backup. Then we’ll discuss how to actually produce a Hamming backup {A,B,C} from an existing seed X together with a source of secure randomness (specifically, by generating A randomly and then deterministically deriving B and C from A and X).

Recovering a seed from a Hamming backup

The relations have the property that, given parts A and B, we can recover the seed X as follows:

  • X1 = A2 ⊕ B1 ⊕ B2 (they’re all inside the left circle)
  • X2 = A1 ⊕ A2 ⊕ B1 (they’re all outside the lower-right circle)

This can be done in three applications of XOR, rather than four, by precomputing the common subexpression A2 ⊕ B1. As such, this is only 50% slower than recovering a seed from a 2-of-2 SeedXOR backup, rather than twice as slow.

Moreover, the whole diagram has order-3 cyclic symmetry, so you can do the same thing with parts B and C, or with parts C and A. That is to say, you can recover the original seed with any two of the three parts.

Important note: when using two parts to generate the seed X, it is important that the two parts are used in the correct order (swapping the roles of A and B, for instance, would result in us generating the remaining part C instead of the desired secret X). If we made this mistake (and recovered C instead of X), then X can be derived by straightforwardly XORing together A, B, and C.

Constructing a Hamming backup

How does one construct a Hamming backup of an existing seed X? The first part, A, should be securely uniformly randomly generated. This will ensure that no information about X is leaked to an attacker who possesses only one of the parts {A, B, C}. (Like SeedXOR, one-time pads, and Shamir secret sharing, this is an information-theoretically secure construction, rather than a construction that depends upon the computational intractability of some problem.)

The remaining parts of the backup can be derived by moving anticlockwise around the diagram, obtaining each value from the XORing the previous two (in cyclic order around the diagram) together with one of the values from the seed X. Specifically, we compute:

  • B1 = A1 ⊕ A2 ⊕ X2;
  • B2 = A2 ⊕ B1 ⊕ X1;
  • C1 = B1 ⊕ B2 ⊕ X2;
  • C2 = B2 ⊕ C1 ⊕ X1;
  • A1 = C1 ⊕ C2 ⊕ X2;
  • A2 = C2 ⊕ A1 ⊕ X1;

The last two lines here are redundant — they compute the values A1 and A2 with which we already started — but are useful as a validity check in case you made errors whilst computing B1, B2, C1, and C2. An independent check is to load {A,B,C} into a COLDCARD hardware wallet, pretending that it’s a regular 3-of-3 SeedXOR. If the Hamming backup has been created correctly, this should load the original seed X = A ⊕ B ⊕ C.

BIP39-friendly way to split X into X1 and X2

If we only cared about the machine implementation of Hamming backups, then it would be straightforward to split the 256-bit value X (i.e. the entropy part of the seed, with no checksum) into an initial 128-bit segment and a final 128-bit segment.

However, part of the reason for favouring SeedXOR and Hamming backups over more complicated Shamir schemes is the ability to perform the calculations relatively quickly using pencil and paper without the aid of a computer.

The BIP39 standard appends an 8-bit checksum to the 256-bit seed (to create a 264-bit value) and then splits it into 24 words of 11 bits:

  • word 1: [11 seed bits]
  • word 2: [11 seed bits]
  • word 3: [11 seed bits]
  • […]
  • word 23: [11 seed bits]
  • word 24: [3 seed bits + 8 checksum bits]

These are each represented either as English words (from a 2048-element wordlist) or as 3-digit hexadecimal numbers where the first digit encodes 3 bits and the other two digits each encode 4 bits, making up the total of 11.

The hexadecimal representation is used when computing SeedXOR manually; a 16×16 lookup table is provided for applying the bitwise XOR operation to hexadecimal digits.

So that Hamming backups can be conveniently computed using the same mechanisms used by ordinary SeedXOR, we propose splitting the 24-word seed X into the 128-bit components X1, X2 as follows:

  • X1 contains words 1–11 followed by the first two hex digits of word 12;
  • X2 contains words 13–23 followed by the first hex digit of word 24 and the third hex digit of word 12.

On paper, we would write the hexadecimal digits of X1 directly above those of X2, like so:

Observe that only the first hex digit of word 24 is included (since the other two are checksum digits). Also, the first two digits of word 12 contribute to X1, whereas the third digit contributes to X2.

Crucially, the hex digits in the seed have been arranged into the two rows in such a way that each 4-bit digit in the X1 row appears above a 4-bit digit in the X2 row, and each 3-bit digit in the X1 row appears above a 3-bit digit in the X2 row. This means that performing bitwise XOR operations between different rows doesn’t ever yield an invalid value (such as causing the first hexadecimal digit in a word to be 9, which is outside the valid range 0..7).

Relationship with Shamir secret sharing

Hamming backups are isomorphic to a special case of Shamir secret sharing, where the degree of the polynomial is 1 (so it’s a 2-of-N scheme) and the construction operates over the field GF(2^2) of four elements. The three parts {A,B,C} encode the value of the polynomial evaluated at the three nonzero elements of the field, {1,ω,ω+1}, and the seed X encodes the value of the polynomial evaluated at 0.

Since we’re only dealing with a degree-1 polynomial, this construction is linear (which is why we were able to implement it exclusively using XOR). Shamir schemes based on higher-degree polynomials are not amenable to this technique, so do not give rise to arbitrary M-of-N analogues of Hamming backups.

The fact that Hamming backups only offer a 2-of-3 scheme is a limitation compared with the arbitrary M-of-N schemes possible with general Shamir secret sharing. It may be possible to implement M-of-N schemes using more sophisticated binary error-correcting codes, but doing so entails further sacrificing the simplicity of the original SeedXOR scheme.

Posted in Uncategorized | 1 Comment

An efficient prime for number-theoretic transforms

My new favourite prime is 18446744069414584321.

It is given by p = \Phi_6(x) = x^2 - x + 1, where x = 2^{32}. This means that, in the finite field \mathbb{F}_p, 2^32 functions as a primitive 6th root of unity, and therefore 2 is a primitive 192nd root of unity. It turns out that this field possesses multiple properties which make it well-suited to performing number-theoretic transforms (the finite field analogue of the Fast Fourier Transform).

Arithmetic

Firstly, note that arithmetic is especially convenient in this field. An element of the field can be represented as an unsigned 64-bit integer, since p is slightly less than 2^64. We can also efficiently reduce modulo p without involving multiplication or division. In particular, if we have a non-negative integer n less than 2^159, then we can write it in the form:

n = A 2^{96} + B 2^{64} + C

where A is a 63-bit integer, B is a 32-bit integer, and C is a 64-bit integer. Since 2^96 is congruent to −1 modulo p, we can then rewrite this as:

B 2^{64} + (C - A)

If A happened to be larger than C, and therefore the result of the subtraction underflowed, then we can correct for this by adding p to the result. Now we have a 96-bit integer, and wish to reduce it further to a 64-bit integer less than p. To do this, we note that 2^64 is congruent to 2^32 − 1, so we can multiply B by 2^32 using a binary shift and a subtraction, and then add it to the result. We might encounter an overflow, but we can correct for that by subtracting p.

In C/C++, the algorithm is as follows:

This involves no multiplication or division instructions. Indeed, we can take a look at the compiled assembly code by using the Godbolt Compiler Explorer:

Observe that, using LLVM as the compiler, the resulting code is branchless (despite the word ‘if’ appearing twice in the source code) and all of the instructions are particularly cheap.

Now, why would we end up with a 159-bit integer in the first place and want to reduce it? There are two occasions where this subroutine is useful:

  • To perform a multiplication modulo p, we use machine instructions to multiply the two 64-bit operands to give a 128-bit result.
  • Left-shifting a 64-bit integer by up to 95 bits gives a 159-bit result.

The latter allows us to cheaply multiply an element of our field by any power of 2. In particular, if we want to multiply by 2^(96a + b), then do the following:

  • if a is odd, then subtract the input from p (in-place), relying on the fact that 2^96 is congruent to −1;
  • shift-left by b bits (to give a 159-bit result) and reduce using the subroutine.

That is to say, multiplying by any 192nd root of unity is ‘fast’, bypassing the need to perform a general multiplication. (This is particularly useful for GPUs, where there isn’t a 64-bit multiplier in hardware, so multiplication expands to many invocations of the hardware multiplier.)

Other roots of unity

As well as possessing these 192nd roots of unity, the field also contains roots of unity of any order dividing p − 1. This factorises as follows:

p - 1 = 2^{32} \times 3 \times 5 \times 17 \times 257 \times 65537

where those odd prime factors are the five known Fermat primes. We can perform a Fourier transform of any length that divides p − 1, since all of the requisite roots of unity exist, but this will be especially efficient if we only involve the smaller prime factors.

Most of these other roots of unity are ‘slow’, meaning that there is no obvious way to multiply by them without performing a general multiplication. There is, however, a nice observation by Schönhage: \sqrt{2} = \zeta + \zeta^7, where ζ is a primitive 8th root of unity, so we can efficiently multiply by the square-root of 2 (which is, of course, a primitive 384th root of unity).

A length-N FFT involves the Nth roots of unity, so we would want to use FFTs of length dividing 192 (or 384 using the Schönhage trick) whenever possible. This suggests the following algorithm for convolving two sequences:

  • choose an FFT length N that divides 3 \times 2^{32} and is large enough to accommodate the result of the convolution;
  • write N as a product of (at most 5) numbers dividing 384. For example, in the largest case we could write N = 128 × 128 × 128 × 96 × 64;
  • use this decomposition as the basis of a mixed-radix Cooley-Tukey FFT. That is to say, we apply:
    • N/128 FFTs of length 128;
    • N/128 FFTs of length 128;
    • N/128 FFTs of length 128;
    • N/96 FFTs of length 96;
    • N/64 FFTs of length 64.

Since there are at most 5 rounds of FFTs, we only need to apply arbitrary twiddle-factors (which can be precomputed and stored in a table) 4 times, i.e. between successive rounds of FFTs. Compare this to an FFT over the complex numbers, where there are only 4 ‘fast’ roots of unity (±1 and ±i) and therefore irrational twiddle factors must be applied much more often.

Fürer’s algorithm

Martin Fürer’s fast integer multiplication algorithm uses this idea recursively, expressing the integer multiplication problem as a convolution over a suitably chosen ring with plenty of fast roots of unity. In particular, he writes:

We will achieve the desired speed-up by working over a ring with many “easy” powers of ω. Hence, the new faster integer multiplication algorithm is based on two key ideas:

  • An FFT version is used with the property that most occurring roots of unity are of low order.

  • The computation is done over a ring where multiplications with many low order roots of unity are very simple and can be implemented as a kind of cyclic shifts. At the same time, this ring also contains high order roots of unity.

The field \mathbb{F}_p for p = 2^{64} - 2^{32} + 1 is a fixed-size ring with these desirable properties. It supports multiplication of integers up to 3 \times 2^{35} bits (12 GB) by writing them in ‘balanced base 2^16’ (where each ‘digit’ can be between −32768 and 32767). Each of the two integers will occupy at most 3 \times 2^{31} digits, so the result can be computed by a convolution of length 3 \times 2^{32}. The maximum absolute value of a coefficient that could occur as a result of such a convolution is 3 \times 2^{61} < \frac{1}{2} p, so the coefficient can be recovered exactly if the convolution is performed over \mathbb{F}_p by reducing each coefficient into the range [−p/2, +p/2].

Of course, Fürer’s paper was interested in the asymptotics of integer multiplication, so it needs to work for arbitrarily large integers (not just integers of size at most 12 GB). The remaining idea was therefore to apply this idea recursively: to split a length-N integer into chunks of size O(log N), and use Fürer’s algorithm itself to handle those chunks (by splitting into chunks of size O(log log N), and so forth, until reaching a ‘base case’ where the integers are sufficiently small that FFT-based methods are unhelpful).

The number of layers of recursion is therefore the number of times you need to iteratively take the logarithm of N before log log log … log log N < K, where K is the base-case cutoff point. This is called the iterated logarithm, and denoted \log^{\star} N.

Each layer of recursion in Fürer’s algorithm costs a constant factor F more than the previous layer, so the overall complexity is O(N \log N F^{\log^{\star} N}). It was the asymptotically fastest multiplication algorithm between its invention in 2007 (taking the crown from Schönhage-Strassen) and 2016 (when an O(N log N) algorithm was discovered).

Practical integer multiplication

In practice, large integer multiplication tends to use either Schönhage-Strassen (in the case of the GNU Multiprecision Library), or floating-point FFTs, or a bunch of number-theoretic transforms over different primes with a final application of the Chinese Remainder Theorem (in the case of Alexander Yee’s y-cruncher). The primes used by these number-theoretic transforms don’t support any ‘fast’ roots of unity, though; all of the twiddle factors are applied using multiplications.

This suggests that a number-theoretic transform over the field of 18446744069414584321 elements may indeed be competitive for the sizes of integers it supports (e.g. up to 12 GB). For larger integers, we can use the Schönhage-Strassen algorithm with this prime field as a base case (one layer of Schönhage-Strassen being more than enough to support multiplication of integers that can be stored in practice).

Posted in Uncategorized | Leave a comment

Hamming cube of primes

Given two nonnegative integers, m and n, we say that they are Hamming-adjacent if and only if their binary expansions differ in exactly one digit. For example, the numbers 42 and 58 are Hamming-adjacent because their binary expansions 101010 and 111010 differ in a single position.

If m and n are Hamming-adjacent, then their absolute difference |n − m| is necessarily a power of 2. The converse is not true, though; 24 and 32 have a difference of 8, but their binary expansions 011000 and 100000 differ in three positions.

We can form a countably infinite graph G on the set of all nonnegative integers by connecting two vertices if and only if they’re Hamming-adjacent. G is a bipartite graph: two integers can only be Hamming-adjacent if one is odious and the other is evil.

G is the union of nested hypercube graphs: the first 2^d nonnegative integers form a hypercube graph of degree d. For example, here’s the subgraph induced by the first 32 naturals:

What about if we take the induced subgraph on only the vertices which are primes? For the primes below 32, we get the following graph:

It’s a connected graph! It remains connected if we repeat for the primes below 64:

What about for the primes below 128? Now it has an isolated vertex, 127, and an isolated edge:

When we go further and look at the primes below 512, the vertex 127 is no longer isolated; it has been connected into the main component via the prime 383:

Similarly, the edge between 89 and 73 becomes assimilated into the main component once we increase our horizon to the primes below 1024.

This raises the question: does every prime eventually become connected to the main component? Or, equivalently, when we form the countably infinite induced subgraph H of G whose vertices are all of the primes, is H connected?

Isolated vertices

It turns out that the answer is no: the prime 2131099 is an isolated vertex in H with no neighbours whatsoever! (It is plausibly the smallest such example.)

How does one even prove that 2131099 is an isolated vertex? In other words, how do we show that all of the Hamming-adjacent integers are composite? Firstly, note that the Hamming-adjacent integers smaller than itself are a finite set obtained by subtracting one of the powers of 2 in its binary expansion:

33947, 2098331, 2130075, 2130971, 2131083, 2131091, 2131097, 2131098

These can all be verified to be composite. But what about the infinitely many Hamming-adjacent integers that are larger than itself? These are necessarily of the form $2131099 + 2^k$ for some value k. It transpires that every element in this set must be divisible by at least one of the primes {3, 5, 7, 13, 17, 241}, called the covering set. In particular, we have:

  • k = 1 (mod 2) implies divisible by 3;
  • k = 0 (mod 4) implies divisible by 5;
  • k = 1 (mod 3) implies divisible by 7;
  • k = 2 (mod 12) implies divisible by 13;
  • k = 2 (mod 8) implies divisible by 17;
  • k = 6 (mod 24) implies divisible by 241;

and together these cover all residue classes modulo 24.

We can go further and show that there are infinitely many such isolated vertices. In particular, every number of the following form:

n = 1885107369798300628981297352055 h + 3316923598096294713661

has a covering set of primes for all numbers of the form n \pm 2^k, as discovered by Christophe Clavier. Specifically, there are two covering sets of primes: one for the ‘+’ case and one for the ‘−’ case; the union of these two sets is just the set:

{3,5,7,11,13,17,19,31,37,41,73,97,109,151,241,331,673,1321}

of prime divisors of the linear coefficient.

So, we just need to show that there are infinitely many primes of this form that are not Hamming-adjacent to any of the primes in the above set. The latter condition is easy to enforce — for example, by insisting that n is congruent to 4095 mod 4096 (i.e. its binary expansion ends in twelve ‘1’s). Then we are left with an arithmetic progression whose offset and common difference are coprime, and a theorem of Dirichlet states that there are infinitely many primes in such an arithmetic progression.

A difficult conjecture

Given the existence of infinitely many isolated vertices, we postulate the following conjecture:

Conjecture: Other than the infinitely many isolated vertices, there is exactly one connected component in H.

What would a counterexample look like? The smallest such counterexample would be a pair of Hamming-adjacent primes with p > q and no further neighbours. The usual method of covering sets would only work under the following set of conditions:

  1. p has a covering set P preventing any larger Hamming-adjacent primes;
  2. q has a covering set Q preventing any larger Hamming-adjacent primes, with the single exception of p, and this is only possible if p is itself a member of Q;
  3. The finitely many numbers smaller than and Hamming-adjacent to p all happen to be composite, with the single exception of q;
  4. The finitely many numbers smaller than and Hamming-adjacent to q all happen to be composite.

The second of these conditions is incredibly implausible, because the primes appearing in a covering set are usually much smaller than the dual Sierpinski numbers that arise from such a covering set. Here we are asking for the prime in the covering set to be larger!

Note that this conjecture is definitely unsolved, because it’s stronger than the conjecture that there are primes of every Hamming weight.

Posted in Uncategorized | 2 Comments

Determinacy

I’d like to take this opportunity to highly recommend Oscar Cunningham’s blog. One of the posts, entitled A Better Representation of Real Numbers, describes an elegant order-preserving* bijection between the nonnegative reals [0, ∞) and ‘Baire space‘, \mathbb{N}^{\mathbb{N}}, the space of infinite sequences of nonnegative integers.

*where the order on the Baire space is defined by lexicographical order.

Ultimately, it is built up recursively from just two ideas:

  1. There is an order-preserving homeomorphism between [0, 1) and [0, ∞), namely xx/(1 − x).
  2. [0, ∞) can be partitioned into countably many copies of [0, 1), stacked end-to-end, one for each natural number.

Baire space and [0, ∞) both admit natural topologies (the product topology and the usual topology, respectively) and this order-preserving bijection is continuous in one direction: from Baire space to the nonnegative reals.

If we allow the first element of the integer sequence to be negative, then this gives an order-preserving bijection between the entirety of \mathbb{R} and the space \mathbb{Z} \times \mathbb{N}^{\mathbb{N}}. Rather similarly to the better known continued fraction representation, rationals are mapped to eventually-zero sequences and quadratic irrationals to eventually-periodic sequences.

Gale-Stewart games

Given a subset X of Baire space, there is an associated two-player game (the Gale-Stewart game G(X)) where players alternately exclaim nonnegative integers:

  • Player A plays a nonnegative integer z0;
  • Player B plays a nonnegative integer z1;
  • Player A plays a nonnegative integer z2;
  • Player B plays a nonnegative integer z3;

and so on, ad infinitum. Player A wins if the sequence (z0, z1, z2, z3, …) belongs to the set X; Player B wins otherwise.

A game G(X) and the corresponding set X are described as determined if one of the players has a winning strategy. To be precise, a strategy is a function from finite initial segments of play to the nonnegative integers, specifying which move one makes after seeing a particular sequence of moves.

It is somewhat unintuitive that non-determined games can exist: after all, it’s impossible for a game to end in a draw, so one may naively conclude that ‘under perfect play’ the game will be won by A or by B. However, ‘perfect play’ is not well defined for these infinite games, and it’s possible to have an undetermined game with the following properties:

  • for every strategy of player A, there exists a player B strategy that beats it;
  • for every strategy of player B, there exists a player A strategy that beats it;

and these are perfectly consistent statements. In ZFC, it’s possible to build an undetermined game using transfinite induction. The idea is to iterate over the tuples of the form (player, strategy, initial_segment_of_play) and to ‘foil’ that strategy by fixing the game outcome to be a loss for that player if that player follows the strategy and the other player plays some other strategy. This is possible because the other player can force the game to hit any of |\mathbb{R}| different outcomes, and thus far in the transfinite induction we’ve fixed strictly fewer than that many points in the Baire space, so there’s always a game outcome that we haven’t yet fixed and can freely set to be a win for the other player. (At the end of this transfinite induction, we likely only have a partially-specified game, but that’s fine; we just set the remaining outcomes arbitrarily, e.g. to be wins for Player B.)

Surprisingly, it’s consistent with ZF (sans choice) that every Gale-Stewart game is determined. This is called the axiom of determinacy. It’s false in ZFC, for the reasons stated above, but weaker versions of this axiom can be proved in ZFC. For instance:

Every open subset of Baire space is determined

and, more generally:

Every Borel subset of Baire space is determined

Consequences of large cardinals

Assuming certain large cardinal axioms, it is possible to show that various generalisations of Borel sets are also determined.

One example of a large cardinal axiom is the existence of a measurable cardinal, an uncountable cardinal S such that there exists a two-valued measure \mu : \mathbb{P}(S) \rightarrow \{0, 1\} satisfying:

  • μ(A) = 0 if and only if μ(S \ A) = 1;
  • if μ(A) = 0 and B is a subset of A, then μ(B) = 0;
  • singletons are small, i.e. μ({p}) = 0 for all points p in S.
  • the union of a collection of strictly fewer than |S| sets of measure 0 again has measure 0.

Without the ‘uncountable’ constraint, \aleph_0 would be measurable because this is equivalent to the definition of a non-principal ultrafilter.

If such a measurable cardinal exists, then it follows that every analytic set (continuous image of a Borel set) is determined, as is every coanalytic set (complement of an analytic set). A proof of this is here; I first encountered it in a 2015 graduate course on Descriptive Set Theory by Adrian Mathias.

Analytic and coanalytic sets together form the 1st level of the projective hierarchy; the 2nd level consists of projections of coanalytic sets and complements thereof; and so forth. (The Borel sets can be viewed as the 0th level of the projective hierarchy.)

If there are n Woodin cardinals with a measurable cardinal above them, then every set in the (n+1)th level of the projective hierarchy is determined. If there are infinitely many Woodin cardinals, then this holds for every n, and therefore every projective set is determined. This was proved in a paper by Martin and Steel.

Finally, if there’s a measurable cardinal above this infinite sequence of Woodin cardinals, then L(\mathbb{R}) (the constructible closure of the reals) satisfies the axiom of determinacy.

Consequences of determinacy

Determinacy has been formulated in terms of infinite games, but it is also of interest in real analysis because it implies various regularity properties about sets of reals. In particular, if a pointclass (family of sets) \mathcal{A} is closed under continuous preimages, such as those we have just considered, then every set in \mathcal{A} being determined also implies that they:

For example, if the axiom of determinacy holds in L(\mathbb{R}), then all three of these regularity properties hold for every set in \mathbb{P}(\mathbb{R}) \cap L(\mathbb{R}).

In Strong Axioms of Infinity and the Search for V (definitely worth reading), W. Hugh Woodin makes a strong case in favour of projective determinacy: projective sets are precisely the relations that can be formulated in second-order arithmetic, and assuming projective determinacy allows one to settle most natural questions in second-order arithmetic.

Posted in Uncategorized | Leave a comment

One-way permutations

One-way permutations are fascinating functions that possess a paradoxical pair of properties. They are efficiently computable functions φ from [n] := {0, 1, …, n−1} to itself that are:

  • Mathematically invertible, meaning that φ is a bijection;
  • Cryptographically uninvertible, meaning that it’s computationally infeasible in general to compute x given φ(x).

We’ll discuss a couple of constructions of one-way permutations. The first construction is simpler to understand, but is less secure (for a given size of n) or less efficient (for a given security level).

Construction I: using modular arithmetic

Let p be a large Sophie Germain prime; that is to say, a prime such that q = 2p + 1 is also prime. Then the field \mathbb{F}_q has a cyclic multiplicative group of 2p elements, and we can find a generator g for this multiplicative group.

Letting n = q − 1 = 2p, we can define a one-way permutation from [n] to itself as follows:

φ(x) := g^x − 1

The function g^x is a bijection from [n] to the set of nonzero elements modulo q, and the offset of − 1 maps this back to the interval [n]. It is easy to compute in the forward direction, by repeated squaring, but is intractible to invert when n is sufficiently large. The problem of inverting modular exponentiation is the discrete logarithm problem, and the largest safe prime for which this problem has been broken is 768 bits long.

To achieve a security level of 128 bits, NIST recommends using a 3072-bit value of n. The reason that this is so large is because the discrete logarithm problem is susceptible to index calculus attacks.

Attempting to use elliptic curves

Recall that an elliptic curve is a nonsingular cubic algebraic curve. By using an appropriate change of variables, it can be written in the Weierstrass normal form as the equation y² = x³ + ax + b, where the nonsingularity condition is equivalent to the constraint that the discriminant 4a³ + 27b² is nonzero.

Elliptic curves admit a group operation. Specifically, to add two points A and B, do the following:

  • draw a line through A and B, and let it intersect the curve again at C;
  • draw a line through C and the point at infinity, and let it intersect the curve again at another point (A+B).

This operation is manifestly commutative, has an identity (the point at infinity), and admits inverses. The operation is also associative; as mentioned before, this can be proved using Cayley-Bacharach.

In cryptography, it’s customary to work with elliptic curves over finite fields such as \mathbb{F}_p. A theorem of Hasse states that the number |E| of points on the elliptic curve E is approximately equal to p, differing by O(sqrt(p)). Particularly useful is when the group of points on the curve is cyclic (a sufficient condition is for |E| to be squarefree), as then we can find a generator point G and formulate an analogue of the discrete logarithm problem.

For a given group size, the elliptic curve discrete logarithm problem is more difficult than the ordinary discrete logarithm problem, since for general elliptic curves there are no known algorithms faster than the Pollard rho algorithm. Note that some curves are much less secure than others; Daniel J. Bernstein has a website recommending how to choose a good curve.

Instead of the group size of 2^3072 recommended for NIST to provide 128-bit security for the ordinary discrete logarithm problem, it suffices to have an elliptic curve with group size approximately 2^256. The security of the digital signatures in the Bitcoin network are based on the intractability of solving the discrete logarithm problem on a particular elliptic curve (called secp256k1) of approximately this size.

Unfortunately, there is no easy way to bijectively map the points of this elliptic curve to an interval of the correct size. It was possible with the previous group, Construction I, thanks to the fact that the multiplicative subgroup of \mathbb{F}_q is exactly the interval {1, 2, 3, …, q − 1}, and that can be mapped to the standard interval {0, 1, 2, …, q − 2} by subtracting 1. But the same thing cannot be done for elliptic curves because the usual way of representing points (x, y) as the pair of the x-coordinate and the sign of y is not bijective: roughly half of the numbers in {0, 1, …, p − 1, ∞} cannot occur as valid values of x because the corresponding values of x³ + ax + b are quadratic nonresidues (so cannot equal y²).

There is a class of elliptic curves for which we can efficiently biject the points with an interval, namely supersingular elliptic curves, but unfortunately this just ends up as a disguised version of Construction I: the elliptic curve discrete logarithm problem for these curves is reducible to discrete logarithm modulo p.

Construction II: twists to the rescue!

If d is a quadratic nonresidue, then the following pair of elliptic curves:

  • E: y² = x³ + ax + b;
  • E‘: y² = d(x³ + ax + b);

are known as quadratic twists of each other. The values of x which lie on E‘ precisely fill in the ‘holes’ formed by the values of x which don’t lie on E, because d(x³ + ax + b) is a quadratic residue precisely when x³ + ax + b is not! (Some extra care is needed to handle the points where y² is 0 or ∞.)

Specifically, we have:

|E| + |E‘| = 2p + 2

and, importantly, we have an explicit bijection between [2p + 2] and the union of the two elliptic curves. As such, if E and E‘ are both cyclic groups and have intractable discrete logarithm problems, then we can manufacture a one-way permutation on [2p + 2] as I describe in this StackExchange answer. This construction appears to have been discovered in 1991 by Burton S. Kaliski Jr.

Avoiding small cycles

Let’s suppose we want to use this one-way bijection inside a pseudorandom number generator, similar to Blum-Blum-Shub but without the existence of a trapdoor (the factorisation of M = pq). It is entirely plausible that φ contains fixed points or short cycles, which would result in periodic output from the pseudorandom number generator.

In particular, if we apply Construction II to a pair of curves with p = 2^255 − 19, we obtain a one-way permutation on [2^256 − 36]. For convenience, we can pad this to a permutation on [2^256] by extending it with the identity function on the remaining 36 points. This permutation has at least 36 fixed points, and plausibly a few more than that.

We’ll extend this to a one-way permutation on [2^384] as follows:

Split the input into a 256-bit ‘upper part’ and a 128-bit ‘lower part’. Apply the one-way permutation to the upper part, and apply a period-2^128 permutation (such as an appropriate affine function modulo 2^128) to the lower part. Then XOR the lower part into the upper part, in-place, to ‘perturb’ it.

By looking at the lower part in isolation, we know that this permutation does not have any cycles of length shorter than 2^128. Moreover, the interaction from the lower part to the upper part is designed to ensure that every orbit will benefit from the hardness of the elliptic curve discrete logarithm problem; without it, the 36 trivial fixed points of the permutation on [2^256] would each give rise to a weak random number generator (a 128-bit LCG).

To use this as an actual cryptographic random number generator, you’d need to choose a suitable output function — for instance, taking a RIPEMD-160 hash* of the 384-bit internal state — and then be careful** to implement the whole thing using constant-time instructions to avoid timing attacks. (On most CPUs, for example, the multiplication instruction lacks this property.)

*practical cryptographic random number generators such as ChaCha20 use less computationally intensive output functions, but in this case we don’t care because it still pales in comparison to the amount of computation in the state update function.

**harder than it seems, because the compiler may optimise your constant-time code into faster variable-time code. Unless you absolutely need the property that its previous internal states cannot be deduced from its current internal state, then I’d recommend using a well-tested implementation of ChaCha20 instead of this 384-bit PRNG.

Posted in Uncategorized | Leave a comment

Cyclotomic fields

The nth cyclotomic field is the field \mathbb{Q}[\zeta] generated by a primitive nth root of unity, ζ. It is an example of a number field, consisting of algebraic numbers, and its dimension is φ(n) when regarded as a vector space over \mathbb{Q}. If n is odd, then the nth cyclotomic field is identical to the 2nth cyclotomic field, so we’ll henceforth assume without loss of generality that n is even.

The intersection of \mathbb{Q}[\zeta] with the ring of algebraic integers is exactly the ring \mathbb{Z}[\zeta] generated by ζ. The first few examples of rings of this form are:

  • n = 2: \zeta = -1, so \mathbb{Z}[\zeta] is just the ring of ordinary integers \mathbb{Z}.
  • n = 4: \zeta = i, so \mathbb{Z}[\zeta] is the ring of Gaussian integers (red, below-left).
  • n = 6: \mathbb{Z}[\zeta] is the ring of Eisenstein integers (blue, below-right).

These three rings admit unique factorisation, meaning that every nonzero element can be uniquely expressed as a product of primes, up to reordering and unit migration. This turns out to be true for precisely the following thirty values of n:

2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30,
32, 34, 36, 38, 40, 42, 44, 48, 50, 54, 60, 66, 70, 84, 90

The corresponding values of φ(n) are:

1, 2, 2, 4, 4, 4, 6, 8, 6, 8, 10, 8, 12, 12, 8,
16, 16, 12, 18, 16, 12, 20, 16, 20, 18, 16, 20, 24, 24, 24

Note that, in particular, the maximum dimension is φ(n) = 24, attained only by the three largest values of n (70, 84, and 90). These rings, viewed as additive groups, are therefore free Abelian groups of rank 24.

And this is where it gets exciting: David Madore noticed that the existence of order-70 and order-84 elements in the Conway group Co0 give rise to highly rotationally symmetric projections of the Leech lattice into the complex plane. Here’s the order-84 projection applied to the first shell:

Projection of the first shell of the Leech lattice into the plane with order-84 rotational symmetry, by David Madore

And here’s the order-70 counterpart:

Projection of the first shell of the Leech lattice into the plane with order-70 rotational symmetry, by David Madore

By suitably scaling and rotating either of these projections, we can arrange for the image to contain the respective nth roots of unity. If we view the resulting projection as a map from \mathbb{Q}^{24} to \mathbb{Q}[\zeta], we can see that it is surjective (because its image contains the roots of unity, and these span the field) and therefore bijective (because the dimensions match).

In particular, that means that the projection remains injective when restricted to the Leech lattice, and its image must be a rank-24 Abelian subgroup of \mathbb{Q}[\zeta] which is closed under multiplication by roots of unity. Given that it has a finite integral basis, we can further scale/rotate the projection such that the images are all algebraic integers and therefore form an ideal in \mathbb{Z}[\zeta].

Thanks to these rings having class number 1, every ideal is principal, so this ideal must simply be a scaled/rotated copy of \mathbb{Z}[\zeta] itself! That is to say, when n is 70 or 84, we can identify the points in the Leech lattice with the algebraic integer elements of the cyclotomic field!

Working in the opposite direction, we can construct the Leech lattice starting from one of these cyclotomic fields by defining an inner product on its algebraic integer elements:

\langle u, v \rangle = tr(\alpha u v^{\star})

where α is a particular element of the ring (different choices yield different lattices), and tr is the trace operation (sum of all Galois conjugates).

There are similar constructions of the Leech lattice for some other values of n, where the class number is greater than 1 (and therefore unique factorisation does not apply), but the constructions could potentially require using a proper ideal rather than the whole ring of algebraic integer elements. For instance, Craig discovered a construction for n = 78, and Eva Bayer-Fluckiger gave a more thorough exploration.

The cases of n = 70 and n = 84 are arguably the most interesting, because they endow the Leech lattice with a compatible ring structure where unique factorisation holds: the nonzero lattice points form a multiplicative monoid isomorphic to the direct sum of a cyclic group of order n (the group of roots of unity) and the free commutative monoid on countably many generators.

Of these, the n = 84 ring is my favourite: it contains the Gaussian, Eisenstein, and Kleinian integers as subrings, and indeed is the smallest cyclotomic ring to do so.

Posted in Uncategorized | Leave a comment

Urbit for mathematicians

The three months since the last cp4space article have largely been spent learning about an interesting project called Urbit. My attention was drawn to its existence during a talk by Riva-Melissa Tez (formerly of Intel) on the importance of continued improvements to the computational efficiency of hardware.

A decent amount has been written on the Internet about Urbit, but it tends to either be non-technical, explaining the importance of the project, or technical but written in highly unconventional project-specific jargon. Instead, we’ll depart from either of these by examining the novel internals of Urbit that are most likely to be of interest to mathematicians and theoretical computer scientists.

Before we can do so, however, it is worth asking what Urbit is. This is often a source of confusion, because it is an ecosystem of multiple interacting components designed to mesh well together. If you were to metaphorically run 2-means on the set of Urbit components, they would cleanly separate into the operating system (Urbit OS) and identity system (Urbit ID).

The most interesting salient components of Urbit OS are:

  • Arvo: a deterministic operating system kernel implemented in a pure functional programming language (Hoon).
  • Nock: a very low-level functional programming language. It’s roughly halfway between a minimalistic Lisp dialect and the SKI combinator calculus. In the Urbit ecosystem, Nock is the machine code for the virtual machine in which the Urbit OS runs.
  • Vere: a software implementation (written in C) of the Nock virtual machine, capable of running the VM inside a Unix environment.
  • Hoon: a higher-level functional programming language in which the Urbit OS is directly written. This ultimately compiles down to Nock.

The identity system, Urbit ID, is a hierarchical system of integer addresses (analogous to IPv4 addresses) called Azimuth points. Azimuth points are securely associated to public keys (either mutably or immutably), and communication between machines running Urbit is built upon this public key infrastructure.

Azimuth points and their hierarchy

As mentioned, Azimuth points are non-negative integers. There are five such levels:

  • Points in [0, 2^8 – 1] are called galaxies.
  • Points in [2^8, 2^16 – 1] are called stars.
  • Points in [2^16, 2^32 – 1] are called planets.
  • Points in [2^32, 2^64 – 1] are called moons.
  • Points in [2^64, 2^128 – 1] are called comets.

These can cumulatively be represented by the datatypes uint8, uint16, uint32, uint64, and uint128, respectively. These datatypes possess standard ‘restriction maps’ familiar from using a programming language such as C:

uint64 → uint32 → uint16 → uint8

where, for instance, a 32-bit integer is coerced into a 16-bit integer by taking the lower half (or, equivalently, reducing modulo 2^16). There are two elegant ways to impose a ring structure on each of these 2^n-bit datatypes:

  • Integers modulo 2^(2^n), which form a ring;
  • Nimbers less than 2^(2^n), which form a finite field.

The first choice is the familiar one (in the C programming language, the operators {+, *, -} coincide with modular arithmetic in Z/(2^2^n)Z). In the second choice, ring addition is bitwise XOR, and ring multiplication is slightly more complicated to define.

For each of these two choices, the aforementioned restriction maps are ring homomorphisms.

Now, how do these restriction maps relate to Urbit? They impose a partial order on the set of Azimuth points, where xy if x is the image of y under a restriction map. The Hasse diagram of this partial order is a forest of 256 trees, each of which has a galaxy at its root. This Hasse diagram determines, for each Azimuth point that’s not a galaxy or comet, a unique parent of that Azimuth point. Most planets, for instance, have stars as parents.

The parenthood relationship for comets is different: the parent of a comet is its image under the restriction map uint128 → uint16, which must be one of five stars willing to host comets. Comets are direct hashes of public keys (so, unlike planets, can’t be transferred between different owners).

Planets, stars, and galaxies are tradable as non-fungible tokens on the Ethereum blockchain. This gives a modifiable decentralised mapping from each Azimuth point (32 bits and easily memorable) to an Ethereum address (160 bits and cryptographically secure). As such, it’s possible to trustlessly verify the owner of an Azimuth point and therefore set up a secure communication channel with that Azimuth point; all communications between machines running Urbit OS are established in this manner.

Names of Azimuth points

The integer representing an Azimuth point is converted into a human-readable name (called an ‘@p’) through the following process:

  1. For planets and moons (not for stars and galaxies), a Feistel cipher is applied to the lowest 32 bits of the integer. This is a reversible permutation designed to obfuscate the relationship between the name of a planet and its parent star, reflecting the fact that planets are permitted to move freely between stars.
  2. The bytes of the resulting (enciphered) integer are each converted to a three-letter syllable, using a separate lookup table for even-indexed bytes (suffixes) and odd-indexed bytes (prefixes). Each syllable consists of a generalised vowel flanked by two consonants.
  3. The syllables are written in big-endian, with a hyphen between each 16-bit word and a tilde at the beginning.

An example of a 32-bit planet name is ~sorreg-namtyv, which belongs to the departed founder of Urbit.

A more comprehensive description of the process, including the two 256-element syllable lookup tables, is given in this StackOverflow answer.

In addition to the pronounceable ‘@p’ identifier, there is a visual encoding called a sigil. This stores each byte of the (enciphered!) 32-bit integer in a separate quadrant of a 2-by-2 arrangement, again using a pair of 256-element lookup tables. The sigil for the planet ~firmer-hatful is shown here:

sigil for ~firmer-hatful

Gavin Atkinson describes the design decisions involved in creating this system of sigils. Planets with circular sigils and/or meaningful @p names (e.g. ~timber-walled) tend to be in greater demand, and therefore more expensive, due to the combination of higher scarcity and aesthetic appeal.

The lookup tables mapping bytes to quadrant tiles were populated manually (by drawing 512 tile designs), so the Urbit sigils can be thought of as partially procedurally generated. [A fully procedurally-generated system may also have worked, such as the system in this Wolfram blog post from 2009, but this is unlikely to change now.] This is, nonetheless, highly consistent with the @p names; the syllable lookup tables were also manually populated.

Nock and its emulator

We proceed to turn our gaze from Urbit ID to Urbit OS.

Nock is a low-level pure functional programming language described here. Code and data are simply binary trees (called nouns) whose leaves are nonnegative integers. The Nock interpreter is a unary function * whose domain and codomain are both the space of nouns; its semantics are defined recursively in terms of itself and other unary functions.

My favourite of these functions is the tree-addressing operator, /, which is defined as follows:

/[1 a]              a
/[2 a b]            a
/[3 a b]            b
/[(a + a) b]        /[2 /[a b]]
/[(a + a + 1) b]    /[3 /[a b]]
/a                  /a

In particular, if we let L(x) and R(x) refer to the left and right subtrees of a tree x, then we have:

  • /[1 x] = x
  • /[2 x] = L(x)
  • /[3 x] = R(x)
  • /[4 x] = L(L(x))
  • /[5 x] = R(L(x))
  • /[6 x] = L(R(x))
  • /[7 x] = R(R(x))

and so forth. In general, /[n x] can be evaluated by:

  • expressing n in binary (for example, 1729 is “11011000001”);
  • removing the leading ‘1’ (in this example, giving “1011000001”);
  • replacing each ‘0’ with ‘L’ and ‘1’ with ‘R’ (in this example, “RLRRLLLLLR”);
  • taking x and sequentially applying each of the characters to it (in this example, giving R(L(L(L(L(L(R(R(L(R(x)))))))))), noting that the outermost function is the least significant bit).

If we enumerate the nodes in an infinite binary tree according to the natural number required to address the subtree rooted at that node using the / operator, this agrees with a breadth-first traversal of the tree. This is the same enumeration (the salmon spiral below) that relates the Calkin-Wilf sequence to the Calkin-Wilf tree, shown in this illustration by David Eppstein:

The operator / is useful enough that Lisp distributions typically implement the special-cases of this function for all 2 ≤ n ≤ 15, referring to them as {car, cdr, …, cddddr}.

Nock, if evaluated naively, would be incredibly inefficient: decrementing an integer n takes time proportional to n. The emulator therefore uses a clever idea called jets: runtime optimisations where the emulator pattern-matches various patterns (such as the decrementation function applied to an integer) and calls a native C function to decrement the integer directly.

The Urbit project has succeeded in implementing an entire operating system kernel as a function in this language, and it is performant in practice owing to the use of jets. The Hoon compiler (which compiles programs written in the higher-level language Hoon into the low-level Nock language) generates code that will be successfully pattern-matched by these jets.

With jets, it’s possible to go even lower-level than Nock. Urbit doesn’t do this because there’s not really much point in doing so, except to give a formal definition of Nock in terms of an even simpler language:

Aside: implementing Nock in the SK combinator calculus

If we’re opting for absolute simplicity, why not go the whole way? It’s possible to implement Nock in the untyped lambda calculus or in the SK combinator calculus.

It would be tempting to try the following method of encoding nouns:

  • The nonnegative integers at the leaves of noun trees can be represented using Church numerals: n is the higher-order function which maps its argument f to its n-fold functional composition, f^n.
  • The ‘cons cells’ [a b] can be represented as Church pairs.

However, unfortunately there isn’t a way to query whether a noun is an integer or a pair. As such, we’ll need to use a slightly more complicated encoding of Nock nouns where we ‘tag’ each noun with its type:

  • encode_noun([a b]) := pair(0, pair(encode_noun(a), encode_noun(b)))
  • encode_noun(n) := pair(1, church_numeral(n))

The unary functions used to define the Nock interpreter can now be themselves implemented as combinators, up to and including the Nock interpreter * itself.

For example, the question-mark operator can be implemented by taking the first element of the Church pair (which is the Church numeral 0 or 1) and then ‘boxing’ it as a Nock noun instead of a ‘bare’ Church numeral:

  • ?(x) := pair(1, first(x))

where the ‘pair’ and ‘first’ functions are defined in the Wikipedia article.

The incrementation operator needs to unbox the integer, take its successor, and then rebox it, like so:

  • +(x) := pair(1, succ(second(x)))

Technically, we’ve cheated here, because according to the Nock definition, +(x) needs to also be defined on ordered pairs as well as integers. In particular, we need to use first(x) to extract a Church numeral specifying whether x is an ordered pair or an integer and branch on that: if the Church numeral is 1, then we can just return pair(1, succ(second(x))); if the Church numeral is 0, then we need to return +(x). The self-reference can be removed from this recursive definition by means of a fixed-point combinator.

The remaining operators are all recursive, so they’ll similarly require the fixed-point combinator in their implementations.

It would be a fun exercise to attempt to write a jet-propelled emulator for the SK combinator calculus, including optimisations such as internally representing Church numerals as integers (with a library such as GMP to allow arbitrarily large integers) and using jets to call the GMP implementations of arithmetic operations when appropriate.

By using the ‘compress identical subtrees’ idea, familiar from Hashlife and libraries for manipulating binary decision diagrams, we avoid excessive encoding overhead as well as enabling constant-time recognition of (standard implementations of!) functions such as the decrement operator.

Posted in Uncategorized | 3 Comments