I am on day 3 – or is it 4, I lose count – of a quarantine since coming down with mild flu-like symptoms on Tuesday. For about half of that time I have been asleep, but the other half has seen a parade of infuriating posts on social media, many of which come from people who should know better. The theme of the moment seems to be that anyone who has installed
matplotlib and knows how to call
numpy.exp() is getting the crayons out and making predictions by extrapolating trends from very little information. It is dismaying indeed to see people who design parton showers for a living point at a scaling law and cry “we’re doomed” without asking what its region of validity is. And if you think this comment is aimed specifically at you, you are wrong because everyone is doing this right now and I know a lot of people who design parton showers.
In response to this I would like to recount a cautionary tale that was a formative experience for me as a post-doc around ten years ago. I had not long joined the ATLAS experiment and was given the job of interfacing the – then – new C++ version of Pythia 8 to the experiment’s software stack and testing it. Pythia is one of the main workhorses used for simulating a wide range of proton collisions, and version 8 was eventually to replace an older version 6 that was written in Fortran. Don’t ask about version 7.
After what must have been a year of careful testing and discussion with ATLAS’ Monte Carlo simulation group, we had come to the view that we could roll Pythia 8 out into production. Simulating collisions is quite an expensive business that can take 20-30 minutes per collision, and we needed to generate samples of tens, if not hundreds of millions. These expensive simulations would eventually be used in things like the analyses that discovered the Higgs boson.
One part of the simulation is pile-up, which I’ve talked about here. On top of every simulated collision there are a bunch of lower-energy collisions that are superimposed. Although most people don’t care very much about those additional interactions, it is necessary to simulate them because they add energy and particles to the observed data. If you don’t do that right then you risk screwing up, say, the energy estimation of a photon from a Higgs candidate event, which people very much do care about. At some point around 2011 we had done enough testing that we thought we could start using Pythia 8 to simulate pile-up, which means it gets overlaid on pretty much all of the simulated samples.
And… it was a disaster, and for a while we didn’t know why. We had diligently tested the simulation against quite a range of data, and it had performed competitively with the old model that it replaced. But when used in the new samples it produced quite a different distribution of particles and energy. Important calibration observables like the sum of the energy deposited in the detector or the Z boson momentum were all screwed up. People were angry and sad and it was our fault. There is nothing quite like the feeling that you may have personally been responsible for buggering up the simulations on a billion dollar science experiment for the next year.
What had happened is that although we had tested against data, we had not accounted for the detector itself. Mainly this is because simulating the detector was prohibitively expensive and we did not realise how important its effect was in this case. Due to the fact that the proton beams must run through the middle of the detector, it cannot detect every particle that is emitted – some of them must escape unseen out of the hole where the beam line is. It is basically a cylinder in which you don’t see particles emitted along the length of that cylinder. Additionally, since the angles get tighter as you go towards each end of the cylinder, the efficiency with which particles are detected gets worse the closer to the beamline you get.
This meant that although we had tested and tuned the rate with which the new model emitted particles against data as well as the angle particles were emitted at relative to the beam line, we had failed to account for the fraction of particles – and entire events – that went completely unobserved. There is a model component called diffraction that is a big contributor to Pythia’s total interaction rate, but since it mainly contributes particles close to the beam line it doesn’t add so much to the observed particle emissions in the central part of the detector. Although the total interaction rate is definitely of interest, what matters as far as performing an analysis is concerned is that part of it that you can actually observe. What had happened was the shape of the angular distribution of particles was wrong, so that too many were produced in the region outside of the detector that we could never see. The older model was ok because it had a lucky conspiracy between the total interaction rate and the shape of its angular distribution.
We temporarily reverted to the old model while this was being understood, and in the end discovered that a scaling factor was needed to account for the unseen part of the cross section. New analysis data on the angular dependence of the emission of energy also became available (link) and we could use that to constrain the shape of the distribution. I still maintain that paper is one of the most under-appreciated by ATLAS, since without it there would be no good constraint on the pile-up modelling that is used everywhere.
This whole experience was one of the most valuable things I ever learned because as painful as it was, I will never forget that a model is only as useful as the data against which you’ve tested it. If you try applying a model outside the region where it has been tested you might get lucky for a while, but you will eventually end up looking like an idiot. What’s worse is it’s all too easy to use a model in an invalid region without realising that’s what you are doing. ATLAS learned from this experience and put in place extremely thorough validation of every model release, and this is also an experience I have brought with me to my new job at Uizard.
The moral of the story should be fairly obvious, but to state it explicitly: even when you have copious good data, as we did at ATLAS, and you carefully check the models against that full range of data, it is possible to miss an important distribution and correlation and end up accidentally over-extrapolating. This kind of thing made me nervous even in the relatively low-stakes world of academic physics. Anyone doing this modelling for a pandemic should be absolutely terrified because not only is the data sparse and terrible, with a huge unobserved region, but if their models are wrong – and believed – then they risk causing a global depression for no good reason. Nobody in their right mind should volunteer to just knock one up to get a few clicks.