covid-19 modeling, Youyang Gu, machine learning, data science


“It became clear that we’re not going to reach herd immunity in 2021, at least definitely not across the whole country,” he says. “And I think it’s important, especially if you’re trying to instill confidence, that we make sensible paths to when we can go back to normal. We shouldn’t be pegging that on an unrealistic goal like reaching herd immunity. I’m still cautiously optimistic that my original forecast in February, for a return to normal in the summer, will be valid.”

In early March, he packed up shop entirely—he figured he’d made what contribution he could. “I wanted to step back and let the other modelers and experts do their work,” he says. “I don’t want to muddle the space.”

He’s still keeping an eye on the data, doing research and analysis—on the variants, the vaccine rollout, and the fourth wave. “If I see anything that’s particularly troubling or worrisome that I think people aren’t talking about, I’ll definitely post it,” he says. But for the time being he is focusing on other projects, such as “YOLO Stocks,” a stock ticker analytics platform. His main pandemic work is as a member of the World Health Organization’s technical advisory group on covid-19 mortality assessment, where he shares his outsider’s expertise.

“I’ve definitely learned a lot this past year,” Gu says. “It was very eye-opening.”

Lesson #1: Focus on fundamentals

“From the data science perspective, my models have shown the importance of simplicity, which is often undervalued,” says Gu. His death forecasting model was simple in not only its design—the SEIR component with a machine-learning layer—but also its very pared-down, “bottom-up” approach regarding input data. Bottom-up means “start from the bare-bones minimum and add complexity as needed,” he says. “My model only uses past deaths to predict future deaths. It doesn’t use any other real data source.”

Gu noticed that other models drew on an eclectic variety data about cases, hospitalizations, testing, mobility, mask use, comorbidities, age distribution, demographics, pneumonia seasonality, annual pneumonia death rate, population density, air pollution, altitude, smoking data, self-reported contacts, airline passenger traffic, point of care, smart thermometers, Facebook posts, Google searches, and more.

“There is this belief that if you add more data to the model, or make it more sophisticated, then the model will do better,” he says. “But in real-word situations like the pandemic, where data is so noisy, you want to keep things as simple as possible.”

“I decided early on that past deaths are the best predictor of future deaths. It’s very simple: input, output. Adding more data sources will just make it more difficult to extract the signal from the noise.”

Lesson #2: Minimize assumptions

Gu considers that he had an advantage in approaching the problem with a blank slate. “My goal was to just follow the data on covid to learn about covid,” he says. “That’s one of the main benefits of an outsider’s perspective.”

But not being an epidemiologist, Gu also had to be sure that he wasn’t making incorrect or inaccurate assumptions. “My role is to design the model such that it can learn the assumptions for me,” he says.

“When new data comes along that goes against our beliefs, sometimes we tend to overlook that new data or ignore it, and that can cause repercussions down the road,” he notes. “I certainly found myself falling victim to that, and I know that lots of other people have as well.”

“So being aware of the potential bias that we have and recognizing it, and being able to adjust our priors—adjusting our beliefs if new data disproves them—is really important, especially in a fast-moving environment like what we’ve seen with covid.”

Lesson #3: Test the hypothesis

“What I’ve seen over the last few months is that anyone can make claims or manipulate data to fit the narrative of what they want to believe in,” Gu says. This highlights the importance of simply making testable hypotheses.

“For me, that is the whole basis of my projections and forecasts. I have a set of assumptions, and if those assumptions are true, then this is what we predict will happen in the future,” he says. “And if the assumptions end up being wrong, then of course we have to admit that the assumptions we make are not true and adjust accordingly. If you don’t make testable hypotheses, then there is no way to show whether you are actually right or wrong.”

Lesson #4: Learn from mistakes

“Not all the projections that I made were correct,” Gu says. In May 2020, he projected 180,000 deaths in the US by August. “That is much higher than we saw,” he recalls. His testable hypothesis proved incorrect—“and that forced me to adjust my assumptions.”

At the time, Gu was using a fixed infection fatality rate of approximately 1% as a constant in the SEIR simulator. When in the summer he lowered the infection fatality rate to about 0.4% (and later to about 0.7%), his projections returned to a more realistic range. 

Lesson #5: Engage critics

“Not everyone will agree with my ideas, and I welcome that,” says Gu, who used Twitter to post his projections and analysis. “I try to respond to people as much as I can, and defend my position, and debate with people. It forces you to think about what your assumptions are and why you think they are correct.”

“It goes back to confirmation bias,” he says. “If I am not able to properly defend my position, then is it really the right claim, and should I be making these claims? It helps me understand, by engaging with other people, how to think about these problems. When other people present evidence that counters my positions, I have to be able to acknowledge when I may be incorrect in some of my assumptions. And that has actually helped me tremendously in improving my model.”

Lesson #6: Exercise healthy skepticism

“I am now much more skeptical of science—and it’s not a bad thing,” Gu says. “I think it’s important to always question results, but in a healthy way. It’s a fine line. Because a lot of people just flat-out reject science, and that’s not the way to go about it either.”

“But I think it’s also important to not just blindly trust science,” he continues. “Scientists aren’t perfect.” It is appropriate, he says, if something doesn’t seem right, to ask questions and find explanations. “It’s important to have different perspectives. If there is anything we’ve learned over the past year, it’s that no one is 100% right all the time.”

“I can’t speak for all scientists, but my job is to cut through all the noise and get to the truth,” he says. “I’m not saying I’ve been perfect over this past year. I’ve been wrong many times. But I think we can all learn to approach science as a method of finding the truth, rather than the truth itself.”