Statecraft
Statecraft
Ten Thoughts on Government Data
0:00
-13:00

Ten Thoughts on Government Data

Government data often underpins policy debates. Nevertheless, those who work with it will know how uniquely frustrating it can be. Relative to the private sector, government systems collect data in idiosyncratic ways. They prioritize continuity and legality over ease-of-use, in anticipation of a narrow set of users. As a result, these datasets can feel impenetrable.

In October 2024, I was trying to understand how international students enter the US workforce: where they move for work, how many of them use programs like Optional Practical Training, and whether they stay in the US after graduating. So, I opened up a dataset from the Department of Homeland Security’s Student and Exchange Visitor Information System (SEVIS). Today this data is available on the OPT Observatory; it’s the most granular public resource available to answer these questions. But it took me over a year to produce. The process of getting there taught me as much about government data as it did anything else.

Thanks to Shadrach Strehle, and Jasper Placio for their support in producing this episode.

For a printable PDF of this interview, click here:

Print this interview

Below are 10 lessons I’ve learned about handling government data:

  1. Administrative data has major gaps. It’s not just that we don’t collect things we should; it’s also that information a system like SEVIS should collect just isn’t in that system.1 While some data gaps result from human error, others are the product of data collection systems that are leaky, or that just don’t exist. We simply cannot know things one might assume we do, like which visa-holders are currently in the country, or the employer of every working international student, because the departure dates and employer addresses of working international students are only present a fraction of the time in SEVIS. The federal government doesn’t know these things either. Failing to adequately maintain records and non-mandatory both result in inconsistent record-keeping. These gaps occur on every level as we decline to write down valuable information, neglect to write down everything we’re supposed to, and fail to hold on to everything we once wrote down.

  1. When something seems off, it often is. Government datasets often have a small number of users; often a handful of civil servants in this or that agency. This means that inaccuracies can persist unnoticed for a surprisingly long time. If you encounter what seems like a major error in government data, it’s less likely to be a failure of your understanding than you might expect. In 2024, the US undercounted the number of international students by 200,000. The error went unnoticed for months until one diligent user contacted the agency responsible.2 The frequency of and methodology for data collection also change periodically, which leads to results that are technically correct, but also unintuitive and potentially misleading. Most quantitative disciplines rightly train students not to assume that the data is wrong until they’ve scrutinized their own work or their understanding of the data first. But if you’re working with certain kinds of government data, you should probably leap more quickly to suspect underlying data issues.

  1. If it’s a question on a form, you can find data on it. Government administrative data is commonly just collated responses to the same questionnaire. Reading the forms which feed into it can tell you what it might contain, and where to find it.3 Since information isn’t always collected where you might expect, learning an agency’s paperwork can save you time, too. While investigating how many H-1B visas go to former international students, and how much they earn, my colleague Jeremy happened to realize that US Citizenship and Immigration Services collects information on someone’s wages and current immigration status when they file an I-129 Petition for a Nonimmigrant Worker. He learned this by talking to someone who knows USCIS paperwork like the back of their hand: an experienced immigration lawyer. Without realizing it, his analysis wouldn’t have been nearly as rich.

  1. We’re not actually counting. Lots of government data is based on representative samples, and uses statistical methods to reach conclusions about the population at large. But that data is not produced by literally counting the population at large. This introduces various assumptions that can easily invalidate your findings if you forget to include them. The “irreversible demographic fact” claimed by politicians last year, that two million more Americans were employed than in the year prior, was the result of using data in ways the statistical agencies explicitly tell users not to. Jed Kolko describes how this statistic was actually a zero-sum accounting artifact, resulting in part from the fact that the population totals are pre-determined by the census, while nativity is not. Since the Current Population Survey measures variable immigrant and non-immigrant populations but is always scaled to match Census totals, any reduction in the reported foreign-born population will necessarily appear as an increase in the native-born population, even if it’s driven by changes in response rates rather than real departures.

  1. Nobody understands statistics. Trying to elucidate statistical subtleties in a policy context is usually a losing battle, and it’s best to avoid trying. If you absolutely must, assume you’re talking to an audience of 5th graders. Never, ever assume the numbers speak for themselves. Be extremely clear about what you intend to show with the graphs you share, how they could be misinterpreted, and why those misinterpretations are incorrect. Policymakers may take numerical claims and their accompanying interpretation at face value, so choose your words wisely — they could get reiterated verbatim. If you want to make a point based on data, stick to publishing graphs with a single red line going up (or down) and to the right. If you want to be honest, include detailed footnotes.

  1. Nobody knows how the whole thing works. Most users of large, complicated government datasets become experts only in narrow parts of them, and there’s rarely a single person who can explain the whole thing. University officials know how to update individual student records in SEVIS, and government officials understand backend processes, like how certain fields are autogenerated, but neither sees the full picture. As a result, hardly anyone ends up drawing connections between the different parts of the system. This means that those who do can provide unique insights. My team at IFP was able to create the OPT Observatory only because of our unusual combination of expertise in immigration law and policy alongside software development, design, and data engineering, allowing each of us to understand different parts of the dataset and draw novel connections. But it took deep collaboration, for months, for us to figure out what was happening in the dataset.

  1. Government data systems were built for administration, not analysis. These systems are designed to help bureaucrats track the processes required to administer a program, which mainly involves answering specific, often rote, questions. They were not made for policymakers who want to synthesize information, or generally understand how a program works. The point of querying SEVIS for someone who works at Immigration and Customs Enforcement is closer to “verify that a given student is in active status and is authorized to work” than it is to “count the number of students working.” These systems can act more like audit trails than flexible databases, accreting answers to a staid list of possible queries. Answering anything outside that list requires creativity, and restructuring the data in ways the system never anticipated.

  1. The trustworthiness of survey data is under threat, making administrative data comparatively more useful than before. In the past, government surveys were often the cleanest, most reliable source of information about the population at large. However, declining survey response rates and AI-enabled spam may affect the statistical power and quality of such data, both government and otherwise. The risk of non-response bias in the American Community Survey increases with declining response rates, and the proliferation of AI chatbots (which make it easier than ever to produce spam) requires agencies to be on watch for disingenuous responses, while also training the public to generally ignore unsolicited contact, including for legitimate government surveys. In a future where survey data is heavily polluted, administrative records which avoid these issues could become increasingly valuable, despite their gaps.

  1. Organizational incentives can make government data messy. Government data systems are infamously brittle. Often, they were last updated decades ago. This forces their government users to develop workarounds that bend them in unintuitive ways to achieve their goals. In government especially, a system’s goals can be influenced by complex political incentives which don’t get written down. Understanding underlying incentives and resulting decisions about how to store information is invaluable for deciphering that information.

  1. Which means that being useful requires practitioner knowledge. For any given trend or perceived abnormality in a dataset, someone deep in the bureaucracy likely knows exactly what caused it. Typically, it’s the result of some conscious action within the bureaucracy — a new regulation, a memo, a digitization project, and so on — and someone dedicated a significant part of their career to enabling it. If you want to discover anything new about such a dataset, you have to find out what others already understand by engaging with their expertise. This means learning from practitioners, who experience how changes to law, regulation, and habit might generate lasting data quirks, and who inherit knowledge of previous data quirks from colleagues. The accumulated knowledge of both policy changes and their implementation makes it readily apparent which data “mysteries” are actually the legacy of changes in user behavior. In SEVIS, changes to the reliability of employer data over time can be explained by a 2008 programmatic change, which resulted in additional documentation of employer information for a subset of records. Exploring such a trend can be helpful for your own understanding of a dataset, but it won’t result in genuinely novel insights unless you can distinguish between what you yourself know about the data and what is collectively known among those who know it best.

Many of these have been previously written about by Jennifer Pahlka, or better articulated by IFP’s Distinguished Senior Immigration Counsel Amy Nice, and I’m sure the list could still be longer. “Government data” is an enormous catchall, and nascent efforts to make it accessible, like data.gov, are a promising start to a challenging problem. In the meantime, I’m excited to continue witnessing how today’s extraordinary access to data can help us understand, better than ever, unintuitive truths about the world’s oldest democratic society. Tweet at me to fill in what I missed.

Thanks to Peter Bowman-Davis, Connor Sandagata, and Jeremy Neufeld for their early comments, and Thomas Hochman for inspiring the format of this post.

1

My colleagues have written about this before.

3

Figuring out what paperwork the agency might have is beyond the scope of this piece!

Discussion about this episode

User's avatar

Ready for more?