zk-Git: Proving authorship of Git contributions using zk-SNARKS

9 minute read

This post is a summary of the zkGit report. The project was supported a grant from the Ethereum Foundation, to which the authors would like to express their gratitude.

Let me start this story at the beginning, then spoil the end, and finish with the fascinating middle. This is the tale of three zk-SNARK engineers trying to answer the following question: is it possible to prove Git contributions to open-source projects while maintaining anonymity? You see…

The beginning: statements about Git contributions.

Open-source contributions can be many things: they make excellent CV material, are a testament to a developer’s skills and dedication and can make the difference between admission of rejection in selection processes of all sorts. At times, they are even linked to direct monetary compensation, such as in the case of cryptocurrency airdrops, etc. Our society, however, is learning the hard way that personal information is precious and giving it away may have subtler consequences down the line than one could anticipate. This applies to open-source work as well: someone’s exact GitHub contributions might reveal political inclinations, personal connections, employment history and timing, and much more. But is keeping contribution data private actually incompatible with claiming open-source work?

That is, in broad terms, the question that kickstarted this project. To put it differently: can someone prove statements about their Git contributions without revealing what those exactly are? Example statements one may be interested in convincing another party of are:

  • I have contributed at least 5 commits to this specific repository (but I will not reveal my identity or which commits those are).
  • I have contributed to at least one of these three repositories (but I will not reveal which one).
  • I had contributed at least two commits to this specific repository by 4/2/2020.

Those statements are rather vague - much more so than it might initially seem. The first step was therefore to delineate specific boundaries for the protocol we were to build. Especially relevant are the following:

  1. The user (in the sequel, the prover) will only be able to prove contributions if these are signed. The cornerstone of the proof will be the knowledge of the private key that was used to sign the commit in question. As a reminder, whenever a developer creates a commit, they can optionally sign it using a private key known only to them. That secret key comes paired with a public key known to everybody, which can be used to confirm that the signature was produced by the holder of the associated private one.

  2. The party verifying those proofs (in the sequel, the verifier) should only need to store a small amount of Git-related information beforehand. Ideally, the only piece of data should be the “head” of the (master or other branches of the) relevant Git repositories. Recall that each commit in a Git repository is uniquely identified by a SHA1 (technically, SHA1-DC) hash called an index, and the head of a branch is simply the index of its most recent commit.

  3. The verifier should not need to interact with the repository in real-time when checking a proof: the minimal stored data described in point 2 should suffice.

The reasons for these constraints range from technical all the way to usability-related. Although we will not delve deep into the motivation, some words on requirement 2 are particularly relevant. Note that a natural way to prove and verify contribution statements of the types described earlier would be for the verifier to hold the signatures of all commits in all repositories they wanted to learn statements about. However, this is an unreasonable expectation in highly data-constrained environments such as blockchain. This is a setting where we envisioned contribution-proving systems potentially being deployed, and it has non-negligible economic costs associated to storage of even small amounts of data. This justifies aiming for a proof system where the verifier only needs to know a fixed SHA1 hash for each repository of interest, frozen at a chosen point in time. The blockchain use case also illustrates the need for constraint 3: interactions of smart contracts with APIs are very problematic.

The end: an impossibility result

The anticlimactic conclusion of this story is that, as we understood a few weeks into the design, a protocol design with the caracteristics outlined above (and more thoroughly covered in the report) is simply unachievable. At least, one that can prove statements for any GitHub repository - and that is the de-facto standard platform for open-source projects. Although the reasons and case analysis are complex, a reasonable attempt to succinctly summarise the issue is the following:

Signature information is frequently lost when contributions are merged into open-source repositories.

Let us briefly expand on that statement. Suppose Alice wants to contribute to the canonical/multipass GitHub repository. She forks it, creates a new branch on her local machine, makes a game-changing commit and opens a pull request to the original repository. A maintainer reviews it and happily merges the contribution into the upstream main branch. Unfortunately, in the resulting commit object, Alice appears as neither the author nor the committer. These are two fields stipulated by the Git standard, and in particular it is the committer that signs the commit. In the case of GitHub, depending on the specific repository, this committer can either be the maintainer or GitHub itself (with a special identity and public key). Alice’s username and email will indeed be mentioned in the message field of the commit object, but this mention is conventional and not a distinguished Git field.

Of course, Alice’s merit is not lost: if one explores the open-source GitHub repository, her username will indeed appear in the website’s interface. It could also, for instance, be verified using the GitHub API. However, the fact that she contributed is gone on the Git-protocol level (i.e. in the commit objects). Unfortunately, the idea of using API calls to check her contribution directly clashes with the protocol’s privacy goals, and it is furthermore incompatible with constraints 2 and 3 from the previous section.

The middle: a hash-chain protocol

Now that we understand the incompatibility between our constraints and the mode of operation of Git platforms such as GitHub, we can reformulate the task at hand as: are there reasonable assumptions under which a protocol satisfying our restrictions can exist? And, if so, what could such a protocol look like? The answer to the first question is positive indeed. We propose an elegant zk-SNARK that achieves those goals assuming certain properties of the target Git repository. Note taht the description below is only a superficial outline: a detailed description can be found in the report linked at the top of the page.

The designed protocol succeeds (at least) if the repository of interest satisfies either of the following two conditions:

  1. When a user makes a contribution, they are actually the commiter of the commit containing it.

  2. Contributions are merged using traditional merge commits - even if these are committed by someone other than the actual contributor.

Let us focus on the case where the first assumption holds. The central cryptographic tool of the scheme is the so-called Incrementally Verifiable Computation (IVC), which allows a prover to convince a verifier about the output of executing a certain function $F$ iteratively a chosen number of times $n$. Each iteration receives as input a state $s$ and witness $w$ and outputs a new state $s’$, which is fed to $F$ in the next iteration - together with a new witness $\omega’$. An IVC proof convinces the verifier about the initial state $s_0$, the final iterated output $s_n$ and the number $n$ of iterations - without revealing the witnesses $\omega_i$ or (possibly) the intermediate states $s_1, \ldots, s_{n - 1}$.

Suppose we construct a function $F$ satisfying the following:

  • Its input state $s$ is a commit index $\gamma$, that is, the SHA1 identifier of a commit.
  • Its witness $\omega$ consists of a set $\delta$ of partial commit data (object hases, author, committer, message, etc.) minus the parent index, together with a private signing key $\kappa$.
  • $F$ signs the full commit data $(\gamma, \delta)$ with the private key, obtaining a signature $\sigma$; and then computes the SHA1 hash of $(\gamma, \delta, \sigma)$, outputting the state $s’ = {\rm SHA1}(\gamma, \delta, \sigma)$.

The last step above precisely recreates the way in which the Git protocol defines a commit’s index. The input commit index $\gamma$, which is hashed as part of the commit data, is precisely index of the current commit’s parent. The first part of zkGit’s magic arises from the following two facts:

  • By the collision-resistance properties of cryptographic hash functions, the only way to obtain a publicly known commit index is by hashing the exact same data that originally produced that index - including the correct signature.
  • The only way to produce the correct signature when signing the original commit data (as guaranteed by the previous point) is with the same private key that was originally used.

Taken together, this implies that if the verifier successfully checks a proof that $F$ outputs the repository head they already know, then they can be confident that prover genuinely knows the private key used to sign that commit. But what if the contribution is not the last commit in the repository, or the prover would like to prove authorhship of more than one commit? It is here that IVC comes in and the second part of the trick unfolds:

  • First, augment the state $s = \gamma$ from the previous design into a pair $s = (\gamma, \tau)$, where $\tau \in \mathbb{N}$ acts as a “signature counter”.
  • Then, augment the witness $\omega = (\delta, \kappa)$ from before into a tuple $\omega = (\delta, \kappa, \pi, \beta)$, where $\pi$ is a “provided signature” and $\beta \in {0, 1}$ is a “selector bit”.
  • Finally, modify $F$ such that, if $\beta = 0$, the signature fed into the hashing algorithm is directly $\pi$; and, if $\beta = 1$, the hashed signature is instead produced using the provided key $\kappa$ as in the previous design. The $\tau’$ in the output state $\sigma’$ is computed as $\tau’ = \tau + \beta$.

With these new definitions, the prover can perform an iteration of $F$ for each commit in the Git history of interest, setting

  • $\beta = 1$ and $\kappa$ equal to their private key if they are the author of said commit.
  • $\beta = 0$ and $\pi$ equal to the (publicly known) signature if they are not.

IVC will link these iterations together thanks to the fundamental relation that each commit’s index (the output $s’$ of each execution of $F$) appears as the parent index of the next commit (the $\gamma$ contained in the input state $s$ of tne next execution of $F$). It is because of this connection that we have termed this approach “hash chain”.

Suppose the verifier receives an IVC proof as above with final state $s_n = (\gamma_n, \tau_n)$ and verifies its correctness. Then can be confident that the prover supplied the actual private key used to sign $\tau_n - \tau_0$ of the commits in a (or in practical terms, by collision-resistance of the hash function: the unique) commit chain ending in index $\gamma_n$. All that remains is to ensure that $\gamma_n$ matches the repository head initially stored by the verifier, thus completing the verification of the proof of authorship. This concludes the outline of the protocol.

The above explanation leaves a few questions unanswered: is producing the original signature the correct way to prove knowledge of the private key? How should this scheme be adapted to repositories where assumption 1 from the beginning of this section does not hold, but assumption 2 does? How collision-resistant is SHA1-DC and how does this translate into the security of the scheme? All of this and much more is covered in detail in the full report linked above. As for potential improvements to or implementations of the protocol, only the future will tell!

Author: Antonio Mejías Gil