Elections…Trust?…Verify!

class: center, middle, inverse, title-slide

# Elections…Trust?…Verify!
## Statistical audits of election results
### Damjan Vukcevic
### SSA Vic / University of Melbourne public lecture
### 30 March 2022

---

# Collaborators

## USA

- Ron Rivest
- Philip Stark

## Melbourne

- Michelle Blom
- Peter Stuckey
- Vanessa Teague

## Students

- Floyd Everest
- Calvin Huang

---

# Overview

1. Why?
2. How?
3. More complex elections
4. Australian elections

---
class: inverse, center, middle

# Why audit?

---

# Elections

- Foundational for democracy
- Can be challenging to run effectively

## Elections in Australia

- Inclusive
- Efficiently run
- Highly trusted

(...relatively speaking)

---

# Two key ingredients

1. Paper ballots
2. Scrutineering

Old fashioned?  Inefficient?

Pressure to:

- Automate, speed up, reduce costs
- Replace people with computers

---

# USA horror stories

[Serious Error in Diebold Voting Software Caused Lost Ballots in California County](https://www.wired.com/2008/12/unique-election/) (2008)

[Wisconsin Election Surprise: David Prosser Gains 7,500 Votes After ‘Human Error’ In Waukesha County](https://www.huffingtonpost.com.au/2011/04/07/david-prosser-wisconsin-supreme-court_n_846431.html) (2011)

[Russians hacked 2 Florida voting systems](https://www.politico.com/states/florida/story/2019/05/14/russians-hacked-2-florida-voting-systems-fbi-and-desantis-refuse-to-release-details-1015772) (2019)

## Move to electronic systems

Rapid computerisation of elections since 2000 presidential election ('hanging chads')

Abandoned many previous verification processes

---

# NSW iVote failures

iVote: electronic voting system used in NSW

NSW local council elections, Nov/Dec 2021:

- System went down for several hours
- Many voters were unable to vote

Also previously:

* Encryption flaws discovered by Vanessa and colleagues  
  (Thomas Haines, Sarah Jamie Lewis, Olivier Pereira)

---

# Australian Senate elections

Ballot papers are handled electronically (for some parts of the process):

- Scanned
- Digitised
- Counted (digitally)

What could go wrong?

- Machines are proprietary and closed-source
- Scanning and counting not fully open to scrutiny(?)

---

# Don't trust your scanner

[Xerox bug](https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning):

.pull-left[
.center[

![](https://www.dkriesel.com/_media/blog/2013/kostreg-vorher.png)

]
]

.pull-right[
.center[

![](https://www.dkriesel.com/_media/blog/2013/kostreg-nachher.png)

]
]

Other examples:

- Hole-punch filling
- Image manipulation attacks

In the Appendix:

- Senate errors experiment

---

# Recommendations and changes

Jan 2018:  
Australian National Audit Office [recommended a 'statistically valid audit'](https://www.anao.gov.au/work/performance-audit/aec-procurement-services-conduct-2016-federal-election)

Nov 2021:  
New legislation passed (Assurance of Senate Counting), requiring:

* A statistical check of the accuracy of the digital data
* Not a 'full' audit

(more details at the end)

---

# Reasons to conduct an audit

Detect inadvertent errors

Detect deliberate manipulation

Increase confidence in reported result

Increase **trust** in our electoral processes

---
class: inverse, center, middle

# Introduction to election audits

---

# Types of audits

* Process/compliance audits

* Results audits

- **Ballot-polling** audits

- Comparison audits

- Hybrid audits

---

# Ballot-polling audits

Like a standard poll

We 'survey' the ballots

We need:

* A trustworthy paper trail (i.e. paper ballots)
* A list/record of all ballots
* Ability to retrieve specific ballots

Typically sample **without** replacement

...but can approximate by assuming sampling with replacement

---
class: clear, center, middle

.font180[Let's do it...]

---
class: clear, center, middle

![](images/alice-bob.png)

[Alice and Bob 'retired'](https://www.telegraph.co.uk/news/2021/09/04/woke-wars-edinburghs-computer-scientists-banned-using-alice/)

---
class: clear, center, middle

.font180[Let's do it...]

Albo vs Scott

---

# Inference for sample proportion

True vote proportion for a particular candidate (reported winner):
`$$p = \Pr(X = \text{Albo})$$`

Sample (assume iid, and no invalid votes):
`$$X_1, X_2, \ldots, X_n \in \{\text{Albo}, \text{Scott}\}$$`

Sample tally:
`$$Y = \text{votes for Albo} = \sum_{i=1}^j \mathrm{1}(X_i = \text{Albo})$$`

Sample proportion:
`$$\hat{p} = \frac{Y}{n}$$`

Sampling distribution:

`$$Y \sim \mathrm{Binomial}(n, p)$$`

---

# Binomial distribution (n = 20)

![](index_files/figure-html/binomial-pmf1-1.png)

---

# Binomial distribution (n = 100)

![](index_files/figure-html/binomial-pmf2-1.png)

---

# Estimation

Suppose we got `$y = 13$` out of `$n = 20$` sampled ballots.
`$$\hat{p} = \frac{13}{20} = 0.65$$`

`$$\text{95% confidence interval: } (0.41, 0.84)$$`

What if we got `$y = 65$` out of `$n = 100$`?

`$$\hat{p} = \frac{65}{100} = 0.65$$`
`$$\text{95% confidence interval: } (0.55, 0.74)$$`

---

# Traditional post-election audits

Sample a **fixed** number (proportion) of ballots

## Example

Californian law 1965:

1% of ballots to be checked by hand

...and what next?

---

# Outcomes

## Possible decisions

* 🏆 **Certify** the reported election result

* ⌛ Request a **recount** of the votes

## Outcomes

|                            | Certify  | Recount  |
|----------------------------|:--------------------:|:--------------------:|
| Reported result is wrong   | 👎    | ✅ |
| Reported result is correct | ✅ | 👎    |

---

# Formalisation as a hypothesis test

|                                        | Reject `$H_0$` (certify) | Do not reject `$H_0$` (recount) |
|----------------------------------------|:----------------------:|:-----------------------:|
| `$H_0\colon$` Reported result is wrong   | Type 1 error (👎) |   ✅  |
| `$H_1\colon$` Reported result is correct |  ✅  | Type 2 error (👎)  |

In our example:
`$$\begin{cases}
H_0\colon p \leqslant 0.5, & \text{Albo did not win} \\
H_1\colon p  >        0.5, & \text{Albo won}
\end{cases}$$`

Error rates:
`$$\begin{cases}
\alpha = \Pr(\text{Type 1 error}) = \text{Miscertification rate} \\
\beta  = \Pr(\text{Type 2 error}) = \text{Unnecessary recount rate}
\end{cases}$$`

---

# Tradition is inefficient

## Landslide election

Very small `$\alpha$` and `$\beta$`

`$\Rightarrow$` Could have used a smaller sample

## Close election

Large `$\alpha$` or `$\beta$`

`$\Rightarrow$` May need a larger sample

## A smarter approach?

Want to sample **as few ballots as possible**,
until we have sufficient evidence

---

# Sequential hypothesis test

Collect only **enough** samples to make a decision

---

# Sequential hypothesis test

.pull-left[

## Statistics

Let `$X = 1$` be a vote for the reported winner (Albo), `$X = 0$` otherwise.

Running tally:
`$$Y_j = \sum_{i=1}^j X_i$$`

Sample proportions:
`$$\hat{p}_j = \frac{Y_j}{n}$$`

]

.pull-right[

## Procedure

1. Sample some ballots

2. Calculate test statistic

3. Evaluate stopping rule, and do one of:

a) Accept `$H_0$`

b) Accept `$H_1$`

c) Insufficient evidence,  
       keep sampling (go to step 1)

]

---

# Sequential probability ratio test (SPRT)

Developed by Wald (1945)

`$$\begin{cases}
H_0\colon p = p_0 \\
H_1\colon p = p_1
\end{cases}$$`

Test statistic (likelihood ratio):
`$$S_n = \frac{\Pr(X_1, \ldots, X_n | H_1)}
             {\Pr(X_1, \ldots, X_n | H_0)}
    = \frac{p_1^{Y_n} (1 - p_1)^{n - Y_n}}
           {p_0^{Y_n} (1 - p_0)^{n - Y_n}}$$`

Stopping rule:
`$$\begin{cases}
S_n \leqslant \frac{\beta}{1 - \alpha} & \Rightarrow \text{accept } H_0 \\
S_n \geqslant \frac{1 - \beta}{\alpha} & \Rightarrow \text{accept } H_1 \\
\text{otherwise} & \Rightarrow \text{keep sampling}
\end{cases}$$`

This controls type 1 error, `$\alpha$`, and type 2 error, `$\beta$`

Is **most efficient** (smallest mean sample size) for the given hypotheses

---

# Example

## 2008 USA presidential election

Winner's vote proportion = 61% (Obama)

Expected sample size = 97 (0.0007% of votes)

(assuming a hypothetical, single-electorate election)

---

# Risk-limiting audits (RLAs)

A procedure that does one of:

1. Certify a reported election outcome (if strong evidence in favour)
2. Escalates to a full manual count (otherwise), which reveals the true outcome

Key properties:

* Never changes correct outcomes
* Corrects wrong outcomes 'most' of the time
  (modifiable guaranteed lower bound)

Limiting risk:

* **Risk** = type 1 error
* RLAs limit the probability of accepting an incorrect outcome
  (i.e. control type 1 error)

The sample size required is variable

---

# BRAVO

An RLA by [Lindeman, Stark & Yates (2012)](https://www.usenix.org/conference/evtwote12/workshop-program/presentation/lindeman)

'Ballot-polling Risk-Limiting Audits to Verify Outcomes'

For 2-candidate contests, equivalent to SPRT with:

- `$p_0 = \frac{1}{2}$`
- `$p_1 =$` reported proportion (possibly adjusted)
- `$\beta = 0$`

Never accept `$H_0$` (reported outcome is incorrect),
until we have seen **all** of the ballots.

We can stop sampling at any time and initiate a recount.

---

# BRAVO

---
class: inverse, center, middle

# More complex elections

---

# 'First past the post'

![](index_files/figure-html/plurality-example-barplot-1.png)

---

# Parameter space with `$k$` candidates

Parameter space:
`$(p_1, p_2, \ldots, p_k)$`, with `$p_1 + p_2 + \cdots + p_k = 1$`

Assume candidate 1 is the reported winner

Hypotheses:
`$$\begin{cases}
H_0\colon \; p_1 \leqslant p_j, & \text{for some } j > 1 \\
H_1\colon \; p_1 >         p_j, & \text{for all } j > 1
\end{cases}$$`

---

# Parameter space with 3 candidates

.center[

![](index_files/figure-html/3cand-ternary-plot-1.png)

]

---

# Pairwise comparisons

Consider each (winner, loser) pair

Need to test:
`$$\begin{cases}
H_0\colon p_\text{winner} \leqslant p_\text{loser} \\
H_1\colon p_\text{winner} > p_\text{loser}
\end{cases}$$`

Test all such pairs, until **every** `$H_0$` is rejected

Can use the **same** sample for all tests

No 'multiple testing' penalty!

---

# Risk limit with multiple comparisons

Certify only if *every* `$H_0$` is rejected

Type 1 error (risk) is bounded by the error for a single comparison

For any `$j$`:
`$$\Pr(\text{reject } H_{01},
    \text{reject } H_{02}, \ldots,
    \text{reject } H_{0m})
\leqslant
\Pr(\text{reject } H_{0j})$$`

---

# More general elections

`$k =$` number of **ballot types**

Parameter space:
`$$\{(p_1, p_2, \ldots, p_k) : p_1 + p_2 + \cdots + p_k = 1\}$$`

Voting rule ('social choice function'):
`$$g(p_1, p_2, \ldots, p_k) = \{\text{winning candidates}\}$$`

Hypotheses for auditing:
`$$\begin{cases}
H_0 = \{\theta : g(\theta) \neq \{\text{reported winners}\}\}, \\
H_1 = \{\theta : g(\theta)   =  \{\text{reported winners}\}\}.
\end{cases}$$`

Some challenges:

- Parameter space can become very high-dimensional
- Some voting rules can be difficult to analyse

---

# SHANGRLA

[Stark (2020)](https://doi.org/10.1007/978-3-030-54455-3_23), with
[software](https://github.com/pbstark/SHANGRLA/) contributions from several
authors

'Sets of Half-Average Nulls Generate Risk-Limiting Audits'

General framework for RLAs

Decouple into separate problems:

1. Voting rules `$\rightarrow$` 'assertions' & 'assorters'
2. Statistical tests `$\rightarrow$` canonical form (univariate, nonnegative)

More details in the Appendix

---

# Testing assertions

Tests always of the same form

`$$\begin{cases}
H_0\colon \text{assorter mean} \leqslant 1/2 \\
H_1\colon \text{assorter mean} > 1/2
\end{cases}$$`

Need a distribution-free test for the mean of a non-negative random variable.

---

# Martingales

Non-negative (super-)martingales

Sequence of random variables `$Z_j$`, `$j = 1, 2, \ldots$`, such that

- `$Z_j \geqslant 0$`

- `$\mathbb{E}(Z_j) < \infty$`

- `$\mathbb{E}(Z_{j+1} \mid Z_1, \ldots, Z_j) = (\leqslant) Z_j$`

Examples of martingales:

* A gambler's fortune in fair game
* An unbiased random walk (can be negative, or can stop at 0)

---

# Ville's inequality (1939)

If `$(Z_j)$` is a non-negative super-martingale, then for any `$\alpha \in (0, 1]$`
and all `$J \in {1, \ldots, N}$`,
`$$\Pr\left(\max_{1 \leqslant j \leqslant J} Z_j \geqslant \frac{1}{\alpha}\right)
\leqslant \alpha \, \mathbb{E}(Z_J)$$`

How do we use it?

* Construct a `$(Z_j)$` with mean 1
* Reject `$H_0$` when `$Z_j \geqslant 1 / \alpha$` for some `$j$`
* `$\Rightarrow$` risk `$\leqslant \alpha$`

---

# SHANGRLA martingales (old)

Kaplan–Markov
`$$S_n = \prod_{i = 1}^n \frac{X_i + \gamma}{0.5 + \gamma}$$`

Kaplan–Wald
`$$S_n = \prod_{i = 1}^n
      \left(\gamma \left[\frac{X_i}{0.5} - 1\right] + 1\right)$$`

KMart
`$$S_n = \int_0^1 \prod_{i = 1}^n
      \left(\gamma \left[\frac{X_i}{0.5} - 1\right] + 1\right) d\gamma$$`

---

# ALPHA

'Audit that Learns from Previously Hand-Audited Ballots'

`$$S_n = \prod_{i=1}^n
      \left(\frac{X_i}{\mu_i} \cdot \frac{\eta_i - \mu_i}{u_i - \mu_i} +
       \frac{u_i - \eta_i}{u_i - \mu_i}\right)$$`

See [Stark (2022)](https://arxiv.org/abs/2201.02707) for details

## Short course

Philip will present martingale-based methods in a short course on 12-14 April (dates TBC)

Look out for an announcement, or **ask me if interested!**

---

# Complications and extensions

## Real-world issues

- Invalid votes
- Missing ballots
- Contest not on ballot
- Stratified sampling

## Other types of election audits

- Comparison
- Batch comparison

(can be handled by SHANGRLA)

---

# RLAs in the 'wild' (in the USA)

![](https://www.stat.berkeley.edu/~stark/Seminars/Audit16/napa-rivest-stark.jpg)

---
class: inverse, center, middle

# Auditing Australian elections

---
class: clear, center, middle

![](images/irv-ballot-small.png)

---

# Australian voting systems

* **Instant-runoff voting** (IRV)  
  House of Representatives, state lower house elections

* **Single transferable vote** (STV)  
  Senate, state upper house elections

---

# STV example

![](images/stv-example.png)

---

# Auditing IRV

RAIRE: 'Risk-limiting Audits for Instant Runoff vote Elections'

See [Blom et al. (2019)](https://arxiv.org/abs/1903.08804)

Uses the SHANGRLA framework and a branch-and-bound algorithm

(Some details in the Appendix)

IRV RLA pilot in San Francisco, 2019

---

# Auditing STV?

Three strategies:

- Use SHANGRLA, by deriving assertions for STV
- Develop a Bayesian audit and derive a risk-limit
- Derive martingales specifically tailored for an IRV/STV election

Some progress:

- [2-seat STV](https://arxiv.org/abs/2112.09921)
* Dirichlet-tree models

---

# Dirichlet-tree

![](images/dirichlet-tree.png)

---

# Dirichlet-tree for IRV ballots

![](images/dirichlet-tree-irv.png)

---

# Recent legislation

Nov 2021: **Assurance of Senate Counting** Bill

“The Electoral Commissioner must arrange for statistically significant samples
of ballot papers to be checked throughout the scrutiny of votes for the
election to assure that the electronic data used in counting the votes reflects
the data recorded on the ballot papers.”

“The Electoral Commissioner must ensure that…at least 5,000 ballot papers in
total are checked” [for a typical election]

Some details vague:

- What is **'statistically significant'**?

- Which errors will be checked?  
  (What about errors that don't influence the count?)

---
class: inverse, center, middle

# Join us!

---

# Positions available

ARC project:
*[In for the count: Maximising trust and reliability in Australian elections](https://dataportal.arc.gov.au/NCGP/Web/Grant/Grant/DP220101012)*

**Postdoc position**  
2 years, full-time  
[Applications close on 7 Apr](https://jobs.unimelb.edu.au/en/job/908247/postdoctoral-research-fellow)

**PhD project**  
Generous top-up scholarship: up to $10k p.a.  
[Applications welcome](https://apps.eng.unimelb.edu.au/research-projects/index.php?r=site/webView&id=820)

---
class: clear, center, middle

.font180[Questions?]

---
class: inverse, center, middle

# Appendix

---

# Senate errors experiment

Victorian Senate 2019

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Mismatches </th>
   <th style="text-align:right;"> Count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>

.center[

**Rough** estimate of per-digit error: (0.03%, 0.38%)

]

---

# Simulating errors

Tasmanian Senate 2016

![](images/senate-errors-simulation.png)

---

# Example applications of sequential tests

- Quality monitoring in manufacturing
- Variable-length computer tests of human subjects
- A/B testing for websites/online services
- Drug/vaccine safety surveillance
- Group sequential testing in clinical trials
- 'Alpha spending' functions
- Meta-analyses that are (expected to be) periodically updated
- Election auditing

---

# Seeking optimality

By definition, RLAs:

- Bound the risk (type 1 error `$\leqslant \alpha$`)
- Never accept `$H_0$` (type 2 error `$= 0$`)

**Efficiency** is measured by the required sample size

Since this is random, we can summarise in various ways:

- Mean
- 90th percentile

---

# SHANGRLA elements

- Assertions

* Mathematical statements about all ballots in a contest

- Assorters

* Mathematical functions on ballots: `$b \mapsto h(b) \geqslant 0$`
    * Correspond to assertions
    * Assertion is true `$\Leftrightarrow$` assorter mean `$> 1/2$`

- Statistical tests are all of the same form:
`$$\begin{cases}
H_0\colon \text{assorter mean} \leqslant 1/2 \\
H_1\colon \text{assorter mean} > 1/2
\end{cases}$$`

---

# Example: Albo vs Scott

* 2-candidate contest
* Reported winner: Albo

Assertion  
`$\quad p_\text{Albo} > p_\text{Scott}$`

Assorter

- Albo `$\mapsto 1$`
- Scott `$\mapsto 0$`
- Invalid vote `$\mapsto 0.5$`

---

# Example: Hesse local elections

![](images/hesse-ballot.png)

---

# Example: Hesse local elections

Party-list proportional representation contest

- Can vote for **multiple** candidates
- Can give **multiple** votes for each candidate

Need **pairwise difference** assertions:

`$$p_A > p_B + d$$`

An assorter for this:

`$$h(b) = \frac{b_A - b_B - d \cdot b_T + m_\mathcal{L} \cdot (1 + d)}
              {2 m_\mathcal{L} \cdot (1 + d)}$$`

See [Blom et al. (2021)](https://doi.org/10.1007/978-3-030-86942-7_4) for details

---

# RAIRE assertions

Two types of assertions:

1. Candidate `$i$` has more first-place ranks than candidate `$j$` has total 'mentions'.

2. After a set of candidates `$E$` have been eliminated from consideration,
candidate `$i$` is ranked higher than candidate `$j$` on more ballots than vice
versa.

Branch-and-bound algorithm to optimise the set of assertions.

RAIRE assertions provide only *sufficient* conditions, but not *necessary*
conditions.

---

# RLAs in the 'wild' (in the USA)

Endorsed by:

- National Academies of Sciences, Engineering, and Medicine
- American Statistical Association
- PCEA, LWV, CC, VV,...

Statewide use in:

- Colorado since 2017
- Alaska, Kansas, Wyoming since 2020

~60 pilot audits across 17 states

Legislated in 11 states

---

# Auditing via Bayesian inference

Back to Albo vs Scott...
`$$p = \Pr(X = \text{Albo})$$`

Specify a **prior distribution** on `$p$`, e.g. uniform
`$$f(p) = 1, \quad p \in [0,1]$$`

Data + prior `$\rightarrow$` **posterior distribution** on `$p$`

'Upset probability': `$$\Pr(H_0 \mid X_1,\ldots,X_n)$$`

Certify election if upset probability is low enough
`$$\Pr(H_0 \mid X_1, \ldots, X_n) < v$$`

---

# Bayes factor

Can reformulate in terms of the **Bayes factor** (BF)

`$$S_n = \frac{\Pr(X_1, \ldots, X_n \mid H_1)}
           {\Pr(X_1, \ldots, X_n \mid H_0)}
    = \frac{\int_{0.5}^1 p^{Y_n} (1 - p)^{n - Y_n} f(p) dp}
           {\int_0^{0.5} p^{Y_n} (1 - p)^{n - Y_n} f(p) dp}$$`

Certify election if `$S_n > h$`

(Note similarity to the SPRT)

---

# Correspondence with the SPRT

Choose the following prior:
`$$f(p) = \begin{cases}
\frac{1}{2}, & \text{if } p = p_0 \\
\frac{1}{2}, & \text{if } p = p_1 \\
          0, & \text{otherwise}
\end{cases}$$`

`$\Rightarrow$` same as the SPRT

Also, can express some Bayesian quantities as martingales

Provides a unified perspective

---

# Some recent work

For ballot-polling audits of 2-candidate contests with no invalid votes:

* Benchmarked several methods
* Bayesian audits are risk-limiting

See [Huang et al. (2020)](https://arxiv.org/abs/2008.08536) for details

---

# Performance comparison example

---

# Properties, not dichotomies

'Bayesian' and 'risk-limiting' are not mutually exclusive

But...
* Upset probability `$\neq$` risk limit

---

# Bayesian audits for complex elections

Just follow the 'recipe'?

* Specify the model

* Certify election if upset probability is low enough
`$$\Pr(H_0 \mid X_1, \ldots, X_n) < v$$`

Some challenges:

* Calculations can get involved
* Need to choose a sensible prior
* Not necessarily risk-limiting

---

# Open questions

Which voting rules cannot be encoded as assertions?

For which voting rules is there:

- a **sufficient** set of assertions?
- a **necessary and sufficient** set of assertions?

Are all sets of necessary and sufficient conditions equally expensive to audit?

Are any voting rules intrinsically harder to audit than others?

Are there optimal tests for testing SHANGRLA assertions, for various types of
elections?

Are Bayesian audits risk-limiting for more complex voting rules?

If so, when can we easily calculate the maximum risk and calibrate the audit?

How does a 'direct' inference compare (in terms of efficiency) to
assertion-based approaches?