How to Predict the Future

"We overestimate our ability to keep secrets"

Nov 08, 2023

The decisions that humans make can be extraordinarily costly. The wars in Iraq and Afghanistan were multi-trillion dollar decisions. If you can improve the accuracy of forecasting individual strategies by just a percentage point, that would be worth tens of billions of dollars. Yet society does not invest tens of billions of dollars in figuring out how to improve the accuracy of human judgment. That seems really odd.

That’s a quote from today’s interviewee, who has made his career helping the intelligence community predict the future better. In this interview, we discuss:

Which prediction methods perform the best?
How does IARPA create tech for American spies?
What technologies give democracies an advantage over autocracies?
Could the Internet have been designed better?

Our interviewee, Jason Matheny, championed research into human judgment and forecasting at the R&D lab for the intelligence community: the Intelligence Advanced Research Projects Activity, or IARPA, which he directed from 2015-2018.

As one bio notes, Matheny spent time in academia — Oxford University, Princeton University, and the Johns Hopkins Applied Physics Laboratory — and the startup world before joining IARPA. He developed new forecasting capabilities at IARPA’s “Office for Anticipating Surprise,” has spent time in the Biden White House, and now runs the RAND Corporation, a major think tank.

IARPA Picks 5 Companies for Data Security Smart Radio Tech Research Contracts - ExecutiveBiz

Jason, what makes the ARPA model unique?

The basic ARPA model is hiring entrepreneurial program managers who are given a budget and autonomy to pick “high-risk, high-reward” research problems. The program managers select researchers to fund, based either on their individual judgment or the judgment of a small selection panel. Then those researchers compete in parallel against a common set of metrics, usually with some disciplined approach to ending funding every year for teams that aren't succeeding.

The three core and unusual elements are the program managers, research tournaments, and defunding research that isn’t succeeding. They distinguish the ARPA model from conventional science and technology funding efforts.

What makes a good program manager? Would the ARPA model work without stellar, entrepreneurial program managers?

Program managers are recruited by office directors, typically on the basis of technical knowledge, an entrepreneurial mindset, creativity, risk tolerance, and predicted ability to succeed in pitching and executing a program.

The program manager candidates make a program pitch based on the Heilmeier questions [guiding principles for choosing ARPA programs]. They go through multiple rounds of review, first with the office director who serves as a gatekeeper. There are other characteristics to program managers that are important. Typically, they need to get a security clearance.

Program managers are given very little training or management. The main performance measure is whether you get your program approved. That process is like a cross between a dissertation defense and a pitch to a venture capitalist. The goal is ultimately to persuade this group of people that your idea could be transformative if successful.

Could a project be too risky or difficult to qualify?

Some projects are too risky. If the premise is, “If successful, this would be transformative, but in order to be successful I have to violate the following laws of physics,” or, “I have to solve a problem that Nobel laureates haven't managed to solve in the last 50 years,” that's going to be deemed too risky. There’s some Goldilocks zone of appropriate probabilities, probably under 50% and over 5%.

There are then problems that aren't risky enough, that are seen as “not ARPA-hard,” even if those problems, if solved, could be transformative.

Really, just from a policy point of view, what you would love is low-risk, high-reward research. You shouldn't disqualify a research project just because it turns out that it would be easier to solve. Ultimately, you should be optimizing based on the cost benefit of a particular research investment.

ARPAs do not tend to go after transformative but easy problems, even if the reason they haven't been solved is because they're neglected or ignored.

Is that because ARPAs aim to fill a hole in the funding space that wouldn't otherwise be filled?

That's right. Across the portfolio of federal agencies that fund R&D, the comparative advantage for the ARPAs is to focus on higher-risk research. The point of having entrepreneurial program managers with relatively little oversight is to give them the ability to go after high-risk research.

When does an IARPA project happen in public?

IARPA developed a process that we called research technology protection: Before any program started, we would do a cost-benefit analysis of making the program secret or not. It included questions like, “Where are the researchers to solve this problem? Are they in academia?” In which case, most of them don't have security clearances, so we wouldn't be able to access them.

If the program is secret, “Does it require a mix of academics and contractors? Does it require crowdsourcing itself to solve? Do we want to pose the problem as a prize challenge, to give it to the world to solve?” You want to figure that out in advance and decide whether the costs of secrecy exceed the benefits or vice versa.

And it really depends. I thought of some problems as much better to keep secret. But then you actually do the math, and you find the benefits of secrecy aren't so significant, and we wouldn't be able to solve the problem in secret. With others, the opposite is true.

Can you give me an example of research that fits in the target area for an ARPA? And then research that would be transformative, but doesn't quite fit?

Stealth technology, for example, would be in the high-risk, high-reward space. That was a historic DARPA (and Air Force) investment. We had models for significantly reducing the radar cross section of an aircraft, but those models produced designs that we couldn't figure out how to fly. That was technically risky. But the impact on military operations, if successful, was going to be transformative. It would mean that you could have aircraft that were penetrating your defense zones. That's a classic case of an ARPA-hard problem that benefited from having program managers who were constructive contrarians, despite being told, “It just won't be aerodynamically feasible to design an aircraft with such a small cross section.”

An example of a project that isn't ARPA-hard: solving a problem that the commercial sector is going to solve in a year anyway, like a better camera for a cell phone. It's not clear that it's super risky because advances are happening in these commercial cameras all the time. The counterfactual, of the world in which an ARPA doesn't invest in that, is that ultimately those advances get made anyway.

What might be another example in the other direction?

Faster-than-light travel. Maybe there are tachyons that are achieving faster-than-light speeds. But figuring out how to leverage that is probably beyond the risk tolerance of an ARPA program.

Will you outline what the Heilmeier questions are?

What are you trying to do?
How is it done today? What are the limits of current practice?
What is new in your approach? And why do you think it will be successful?
Who cares? What difference will it make?
What are the risks?
How much will it cost?
How long will it take? What are the midterm and final exams to check for success, (which basically means: how do you evaluate that you're succeeding or failing)?

Those questions are deceptively simple, but they help to guide one's approach to designing a program.

You highlighted the counterfactual thinking that goes into deciding what qualifies as a good ARPA project. Do the ARPAs have a framework for thinking through counterfactuals?

So I love the ARPA model. I think it's a great alternative to a lot of traditional science and technology funding. There's also some blind spots. For instance, the Heilmeier questions don't have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

We also tend not to think much about strategic move and countermove. Particularly in national security, where different actors are competing against one another in technology, we probably should think more about baking security and safety into the technologies we create. For example, are there ways of making a technology less prone to reverse engineering, to theft, to misuse? Because many things that the U.S. government funds get stolen, even when we classify them. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

So you should think in advance: “Will I be better or worse off having developed this technology if it's stolen?” By making certain research programs known, you create a security dilemma: you prompt foreign intelligence services into thinking that you're planning to use some new offensive weapon. You need to be thoughtful about what information can be misinterpreted by others if they're aware of an investment. We also tend to overestimate our ability to keep secrets.

Why is that?

In part, when you're operating in a classified environment, you're so aware of the protections: the guards, the guns, the gates, the special rooms with air gapping and no electronic devices. The environment creates so many inconveniences to go in and out, you assign a high level of confidence to it. But there are so many historical instances of our protections failing that we should be more realistic about the base rate of things getting stolen. That has implications for the research that we pursue and how we pursue it. For example, if you develop a technology that would actually create an asymmetric disadvantage if used against you.

That strategic analysis isn't inherent in the DARPA Heilmeier questions or the ARPA Heilmeier questions.

In your time leading IARPA, you made some adjustments to the Heilmeier questions.

We continued to use the Heilmeier questions, but we wanted to get at how competitors are going to respond to a technology being developed. So we developed some additional questions:

What's your estimate about how long it would take a major nation competitor to weaponize this technology after they learn about it? What's your estimate for a non-state terrorist group with resources like those of Al Qaeda in the first decade of the century?
If the technology is leaked, stolen, or copied, would we regret having developed it? What first mover advantage is likely to endure after a competitor follows?
How could the program be misinterpreted by foreign intelligence? Do you have any suggestions for reducing that risk?
Can we develop defensive capabilities before or alongside offensive ones?
Can the technology be made less prone to theft, replication, and mass production? What design features could create barriers to entry?
What red team activities could help answer these questions? Whose red team opinions would you particularly respect?

Red teaming is a really important check on the strategic value of a particular technology, and it's not done frequently enough. We typically start research programs in the ARPAs without having somebody assigned to thinking, “Imagine we're successful. What are you going to do in response to this technology that we've just created? How are you going to counter it?”

Why isn’t that a common mode of analysis?

We don't do enough red teaming in general. Sometimes it's just awkward, because you're trying to beat the stuff that you're investing a lot of money and time to building. You're not highly motivated to see the ways that it breaks. There are institutional and even personal incentives not to explore the ways in which your own investment is potentially vulnerable. Also, it takes time and investment, and that trades off against efforts that are seen as more within the “job jar” of an R&D agency.

But in fact, it’s one of our best ways of identifying vulnerabilities early and creating countermeasures. The ARPAs could say, “It’s not our job to figure out how to break our own technology.” But it's not clear any agency is better qualified. You need folks who are going to be sincere about trying to break it, who aren't friends with the program manager, who are sufficiently separated as honest brokers.

How do you institute honest broker red teams bureaucratically? I'm guessing that if we sit together at lunch every day, I may not make as good of a red team as someone who has a neutral or adversarial relationship with you.

Most ARPA programs will have testing and evaluation teams. Those roles are typically played by a federally funded research and development center (FFRDC), or a university-affiliated research center, which are able to leverage private sector expertise, but have reliable enough funding that they don't have to worry about irritating the government. They can just be trusted to tell the truth. They're also temperamentally selected to tell the truth even if it's unpopular. Places like RAND, IDA, the Center for Naval Analysis and others are magnets for nerds who are honest to a fault.

The ARPAs then put the FFRDCs to work in evaluating whether the program is successful. They are typically not put to work on the other side, which is, “Now pretend that you're our enemy and are figuring out how to counter this technology, or figure out how you're going to put it to use once you’ve stolen it.”

I think the institutions already exist to carry out that work: they're just not typically asked to do it. Again, in the cyber realm, we have organizations that focus on red teaming and pentests, and there are competitions, like Defcon, in which that red team activity is celebrated.

We don't do that as systematically in national security. There's no permanent red team at the (NSC) National Security Council that tries to anticipate how China will respond to a particular US action. It might be incredibly valuable for that to exist.

It's funny you mentioned this bureaucratic weakness at thinking like our adversaries. This newsletter takes its name from a book called Informing Statecraft by Angelo Codevilla, a long-time intelligence community member and critic. His big critique of the Cold War-era intelligence community was that it was unable to put itself in the shoes of the KGB.

What’s the relationship between IARPA and the intelligence community like?

So IARPA is similar to DARPA: it has the same overall model of program managers, programs run like research tournaments, and a rigorous process for stopping funding for things that aren't working. The main difference is that our end user is the intelligence community, which is a collection of 18 agencies that collect, analyze, interpret, and protect information for US decision makers. The goal of IARPA is to fund advanced research that ultimately improves decision advantage.

The intelligence community is maybe 1/10th the size of the Defense Department. And because things don't need to be mass-produced, you don’t need as many pieces of equipment for the intelligence community compared to the DoD, the problems that are worked on can be one-offs. You might just need to build a single device, or five of a device.

Because there's often a greater premium on secrecy, there's more thinking about the security around technology efforts. In some cases, that includes preventing it from even being recognized as an area of research.

The technology base is different for DARPA. There’s a lot of established defense contractors. The intelligence community will often have contractors that are smaller and less well known for security reasons [for more on these reasons, see IARPA: A Modified DARPA Innovation Model]. All the program managers require a TSSCI clearance. That's not true of DARPA, where most of the program managers might have secret or top secret clearances.

Does the smaller purchasing base of the intelligence community limit what IARPA can usefully work on?

In some ways it can open it up because if you don't have to mass produce something, then the technical space of options offers more degrees of freedom.

The problem of collection is typically in collecting electrons or photons, so those end up being physics and electrical engineering problems. There's also chemistry problems, since we're increasingly concerned about misuses of biology, so figuring out how to analyze chemical signatures of biological activity is important. And the analysis side certainly does involve things like statistics, computer science, and machine learning research.

But ultimately, decision making is predominantly human, and the psychology of cognitive biases and heuristics is really important. There's been more room for investing in research, human judgment, and in cognitive psychology compared to the other ARPAs. Then there's the problem of information protection, and there it's back to physics and engineering problems. On the protection side, it's a cat and mouse game of trying to figure out how you would counter all of the clever things that you're doing to collect information.

In your time at IARPA, you championed the focus on human judgment and tools to augment human forecasting. Tell us about that.

I became interested after reading the book Expert Political Judgment by Phil Tetlock, which is one of the more ambitious pieces of research on the accuracy of human judgment about policymaking. Part of the reason for working on those problems at IARPA was that they seemed neglected, relative to what the societal benefit could be for US decision making in national security and foreign policy.

Over a 20-year period, Tetlock evaluated the accuracy of forecasts from a bunch of different participants about world events and found that the accuracy wasn't substantially better than random chance, in many cases. The decisions that humans make can be extraordinarily costly. The wars in Iraq and Afghanistan were multi-trillion dollar decisions. If you can improve the accuracy of forecasting individual strategies by just a percentage point, that would be worth tens of billions of dollars.

Yet society does not invest tens of billions of dollars in figuring out how to improve the accuracy of human judgment. That seems really odd. You really can substantially improve human judgment through a few interventions that are pretty robust, over different cohorts of people, over different periods of time, over different kinds of analytic questions.

Did you encounter pushback to prioritizing human judgment projects at IARPA?

I was lucky that I had great bosses. They weren’t completely convinced that the methods were necessarily going to work, but they recognized the value of methods that would work and were willing to take the risk. The Director of National Intelligence, Jim Clapper, Robert Cardillo as Deputy for intelligence integration, and Stephanie Sullivan as Principal Deputy Director: the three of them supported and protected this research.

We also took a particular approach: setting up a repeatable process to evaluate methods, rather than picking which horse was going to win in advance. That created an experimental test bed in which you could evaluate a bunch of different analytic methods for their accuracy, and some of them could be based on human judgment. Some of them can be based on machine learning. Some of them could be based on a combination.

Then you could run them all in a forecasting tournament to see whether they're making accurate forecasts about real world events. The goal really was to create an observatory in which different kinds of methods could be compared to one another and tested for accuracy.

In at least one of those tournaments, crowdsourced forecasts seemed to significantly outperform traditional statistical forecasts or experts. To what extent can that be leveraged by the intelligence community?

Here’s the math intuition: that if you think of a human judgment as (truth + random error + systematic bias), when you take the average of a bunch of human judgments, you're tending to cancel out the random error. And if those judgments are diverse, based on different sources of information and different mental models, that will tend to cancel out systematic bias, so your estimate is going to get closer to the truth.

This goes all the way back to Francis Galton and his experiments of getting crowds to estimate the weight of an ox. The work that we did at IARPA in the ACE program and in forecasting tournaments that followed supported this notion.

There is a way of improving on that: finding subsets of people who are consistently more accurate than others, “superforecasters,” and then taking the average of their judgments. It's still, though, taking an average of relatively large numbers of judgments, as opposed to simply taking the super-duper-forecaster who's consistently #1. Methods of picking a single individual didn’t perform well, even if it was cherry picking the single best individual for a particular field.

That's an important insight into analytic accuracy that not many institutions have adopted. That's peculiar, because it's a research finding that's been pretty robust across different subjects, and many institutions should be highly motivated to increase their accuracy.

So why don't we see these crowdsourcing mechanisms more widely used? Why don’t institutions regularly run internal research tournaments to find their own superforecasters?

I don't have good answers to that question. It's something that Robin Hanson has a pretty cynical hypothesis about; that the leaders of most institutions aren't primarily motivated by increasing accuracy, but are more motivated by protecting their own status, or those of senior leaders. Crowdsourcing is inherently disruptive to leadership.

This was Codevilla's view about other failures of the intelligence community, that it's hard to resist bureaucratic capture because of these fundamental human challenges.

I think you're right. Institutions in general have reasons to be anxious about either threats to their credibility or their legitimacy. Having your homework graded is uncomfortable.

Almost everybody would prefer to grade their own homework. Senior leaders are typically in a BOGSAT [Bunch Of Guys Sitting Around A Table] process, where they're deliberating and reaching some conclusion. Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

IARPA Director Jason Matheny advances tech tools for US espionage - Bulletin of the Atomic Scientists — Jason Matheny

Has IARPA had success in pushing the intelligence community to do less BOGSAT and more leveraging of crowds, averages, and statistical insights?

It did. From about 2009-2019, we started a crowdsourced analysis effort at IARPA that became a grassroots activity. Thousands of intelligence analysts across the agencies used crowdsourcing tools to make forecasts or analytic judgments.

But to my knowledge, there isn't an activity like that right now. There are activities like that in other intelligence communities. For example, the UK intelligence community has this effort called Cosmic Bazaar, which was prompted by the IARPA effort and includes crowdsourced judgments. Working level analysts love it. But unless you've got top cover and a dedicated budget that's protected, it's going to fade away, because managers aren't highly motivated to keep this activity going.

Our strongest proponents were the folks working on a hard analytic question who wanted to see what other analysts thought about it, and wanted to understand their reasoning. “I want to see whether the probability assigned to conflict in Kashmir is going up or down over the next month. If some analysts think it’s going up, what evidence are they looking at?” So you could turn it into a warning tool, like a heart rate monitor for the intelligence community to understand what’s driving the anxiety of analysts.

These kinds of crowdsourcing tools have to be contrarian. You have to give bragging rights to folks who are correct about something that is unpopular. You need a leaderboard of folks who are accurate when the majority is wrong. If it's just a popularity index, it's not going to be as analytically useful.

What kinds of technology might we want to avoid developing?

Cyberweapons are a case of technology where the United States might be more vulnerable than our competitors, because we have a bigger attack surface and a more open society. The same might be true for advances in biology. We have a strong taboo against the use of biological weapons, and a legal set of commitments against developing them. We had a unilateral moratorium on biological weapons before there was an international treaty on it, and not every country believes they're taboo. Some countries have active offensive biological weapons programs. If a biological attack against the United States required rapid vaccination or antiviral distribution to the entire population, we might see what we went through with COVID, which was low rates of vaccine acceptance. So bioweapons are another potential technology that we should be extremely cautious about developing, like cyberweapons.

Our degree of openness and interaction means that infectious agents, whether biological or digital, get transmitted quickly in our societies. We're decentralized and counter-authoritarian in ways that make the distribution of countermeasures quite difficult.

What might we especially want to develop, because it gives us an asymmetric advantage?

Democracies have an asymmetric advantage in developing and using encryption, or other privacy-enhancing technologies. Anything that involves some amount of dissent or loss of control, like large language models, would be an example. The regulation around language models in China has been driven by CCP anxiety over models spitting out anti-party rhetoric or providing historical references that are not approved.

Some areas of asymmetry probably are because of historical inertia, not an inherent democratic advantage. For example, the moats around semiconductor manufacturing are due to historical events: the United States had a lead in semiconductors, going back to the 1950s, and the design and tool-making firms grew up here. The manufacturing ended up moving to other democratic states that were close allies and partners.

By the way, this question is really understudied. People used to say the Internet would be an asymmetric advantage for democracy. Not necessarily! Policymakers and strategists should think about which future technologies will, at the margins, provide a greater advantage to democracies vs. autocracies. I'd love to see more work on that.

You mentioned the Internet, which everybody assumed was a boon to democracy. That turned out to be more complicated. How do you make good predictions when future effects are very hard to know?

I think we should treat it skeptically and make sure we’re thinking from both sides of the game board. I don't know if anybody at DARPA building the Internet thought, “Suppose I'm an authoritarian state. How am I going to use the Internet? Suppose I'm a revanchist state that wants to disrupt some democratic power. How am I going to launch disinformation attacks?”

For a question like, “What technologies provide an asymmetric advantage for democracies,” you should have somebody play the role of the autocrat, who thinks about how to use that technology to break you or to advance their own objectives.

That goes for all differential technology development. We have bets on which kinds of technologies are defense-dominant. We should have red teams that are assessing whether that defense dominance is real.

What would have been different about the development of the Internet if we’d known in advance that authoritarians would use it to spread disinformation and firewall it in their own regimes?

There have been a few papers written about how we could have designed the Internet better from scratch. Having packets that include headers that are not completely vulnerable to anonymization, obfuscation, or misattribution, for instance. Having some way of assigning credibility to some nodes vs. others. Having a legal framework that would precede the development of opportunities for regulatory capture by large entities.

I don't think we would have gotten it perfect, and I certainly don't think federal funders would have said, “Eh, pass. We're not going to develop this technology.” But we could have baked in certain defenses against disinformation in how we designed the Internet and its early regulatory structure.

The same thing is true of some new technologies. We should have been more thoughtful about DNA synthesizers. If each one has more destructive potential than a uranium enrichment facility, you probably want to figure out some way of creating baked-in security and safety. We didn't do that, and now anyone could recreate smallpox, or something worse, with commercially available DNA synthesizers for under $100,000. At the time that DNA synthesizers were being developed, we didn’t even ask the question, “Is there any way of putting security into this technology from the start?”

Now we're having this discussion with AI tools, about whether you can build alignment into models when they're being built. Can you do red teaming much earlier on? A positive example of red teaming is that most of the major AI labs are sincerely trying to figure out how their tools could be misused. I think we're going to see a lot of baked-in guardrails, reinforcement learning with human feedback, rather than retrofitting safety features.

You founded CSET, a national security think tank. At IFP, my think tank, a team member noticed that we’re not in the habit of making PDFs, but the national security apparatus loves PDFs. What are the best kinds of outputs for a think tank to produce for the national security apparatus? Any tips and tricks?

It can vary, because the policy audience is so broad. Some of them want a one-pager. Others are willing to absorb a 200-page report, or their staff will.

In some ways, we had been operating under the RAND model of producing reports. We hired from RAND, the style of the reports looked like RAND. It's the same font. But there is a generational change going on: more policymakers grew up with phones, and expect apps and dashboards. They don't necessarily want a report; they want an annotated map, or to toggle through a bunch of scenarios by changing an assumption and understanding its dynamic effects on an outcome. We’ll end up creating more dashboards, more tech observatories where you can dig into the data and run scenario analysis.

If you had full authority over embedding forecasting work in the US government, what would be top of your list?

Ensuring that a place like the National Security Council has its key questions on a prediction market or some other forecasting platform. Having worked there and seeing the time pressure that folks are operating under, you don't have time to do research, to read very much. You're operating under sleep deprivation and hunger on some of the most consequential decisions that the country faces.

Also, the intelligence that reaches the NSC has been whittled down, and you don't have time to ask for a coordinated analytic product, like a national intelligence estimate, because that takes months. It's very hard for intelligence briefers to collect information from enough analysts to indicate whether they could be making a catastrophic mistake.

So finding an efficient way of crowdsourcing forecasts that are conditional – If I pursue Action A, this is the likely consequence – I would have loved access to a tool like that.

You’ve talked about the difficulty of sustaining an organization that rewards high-risk activities. How do you stop an organization from becoming bureaucratically captured?

A lot of it is just hiring great people. At almost every enterprise, 90% of the variance is determined by the people who are hired, retained, and promoted. The ARPAs work primarily because of unusual hiring authorities that allow them to optimize for sharp, creative, constructive contrarians who think differently about problems.

It helps to have Congress on your side, because Congress is ultimately appropriating funds for these agencies. A track record really helps. DARPA’s important role in stealth, GPS, and the Internet means it gets great congressional support. IARPA got some important wins, particularly in the classified domain, which helped.

It’s important to clearly communicate to leaders why you need a cell of misfits, and why they should allow them to operate in a way that is different from other parts of an organization. Scientists and engineers tend to have a stronger libertarian bent, and you want a place where eccentric scientists and engineers can work effectively. We need nerds, and institutions that attract nerds. That often means letting people wear shorts and flip flops and worrying less about the cultural conventions that are common in other parts of an organization.

In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can't figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it's probably up to us to solve it.

In fact, when problems are neglected, it’s Bayesian evidence that the problem is likely to be more cost-effective to solve than problems that are popular. If a problem is popular, it has lots of people trying to solve it, and it hasn't already been solved, and it's probably really hard.

Does the fact that IARPA projects are often secret help attract the right talent? Is it a barrier to hiring people who want their name to be on public research papers?

There is a selection effect: The people who care most about getting their names on journal articles are not going to be attracted. The types of people who might come are folks who were academics and were really disappointed by the rat race for tenure, or the incentive systems within academic journals that prevented publication of important work.

They want to solve important problems for which there might not be much academic reward. Or they’re coming from industry and don’t just want to develop the next form of ad targeting that improves click-through rates by 5%. I was really amazed by the quality of people at IARPA. It has a really strong self-selection effect.

What surprised you about the policy world?

So much of the policy world is based on interpersonal trust. For many policies, it's going to be years before you see whether they were successful, and in a group effort the effect of an individual is difficult to isolate. So much of selection and advancement for personnel is just based on interpersonal trust. It's really important then to invest in trust and to try not to degrade it.

There are a few things that people who want to have an impact in policy and improve the world should probably assign more weight to than they would intuitively. And one of them is to be kind. Because if you're in a stressful policy environment, your first instinct might not be to spend time on it, but you should, because these people almost certainly could be much wealthier doing something else. There's already a selection effect, where the people doing this job tend to be people who have a level of altruism that's unusual.

Also, people really like being with kind people more than they do unkind people. Industry and academia tend to have a greater tolerance for brilliant jerks, in part because you can measure their contribution and say, well, John's a jerk, but he's a plus player. He's a 100x programmer or whatever. You can't really do that in the policy world.

I found much less evidence for that style of personality succeeding in the policy world compared to those other parts of society. The policy world is much more like West Wing or Madam Secretary than like House of Cards. It tends to be much more prosocial and compassionate.

Thanks for joining us, Jason.

All your questions have been awesome. I wish that your interviews had been available 20 years ago, because they would have been so helpful to me.

For a longer treatment of red teaming, see Micah Zenko’s book Red Team: How to Succeed By Thinking Like the Enemy. For an in-depth account of IARPA’s human forecasting work under Matheny, see IARPA’s New Director Wants You to Surprise Him, from 2015. Matheny has written about biodefense countermeasures he’d like to see in a co-authored paper, Incentives for Biodefense Countermeasure Development.

Jeff Roberts

Nov 8, 2023

That was such a brilliantly interesting interview! So much great insight there. Just wow!

Expand full comment

Dmitrii Zelenskii

Jul 19

My intuition is that no research worth its guts would pass Heilmeier questions. That's something you should ask engineering projects, not researchers. (See https://omer.lingsite.org/blogpost-the-ills-of-academic-linguistics-part-2-the-grantification-of-everything/index.html for reasons why.)

1 more comment...