Research Statement
I am a political methodologist turned senior technology executive. My contributions have helped build the emerging area of computational social science, which intersects data science with politics, economics, and business to produce interdisciplinary research applications. I do not believe big data will save us; in fact, my industry experience has shown me that without the right analytical methodologies that allow us to draw actionable, clear insights from it, big data will bury us. Nirvana is the marriage of data science with politics, economics, and business to create responsible solutions that improve life just as much as they extend the boundaries of our knowledge. These goals have driven me to build profitable acquired businesses, serve the President at The White House, lead Fortune 500 organizations at Cisco and Amazon—and to earn my Ph.D. in political methodology from Columbia.
My research program innovates methodological techniques in natural language processing and causal inference. During the time I’ve split between academia and industry, I’ve applied these innovations to scholarly problems in the study of political polarization and democratic norms, as well as to practical business matters in advertising and quantitative finance.
Leading research efforts in topics as diverse as American political institutions, political behavior, monetary policy, development economics, distributed computing, digital business models, and quantitative finance has brought me joy and intellectual purpose. I have been privileged to publish their fruits in journals ranked at the top of their fields, including the American Political Science Review, the Journal of Politics, and Energy. According to Google Scholar and Altmetrics/PlumX, these publications have gathered more than 300 journal citations, 24,000 direct views, 2,300 social engagements (likes/comments/shares), and 30 press mentions in the past 5 years. Graduate-level computational social science courses have adopted the papers as course materials at Auburn (POLI 8970), U. Chicago (MACS 30122), Indiana University (POLS Y661), Rochester (PSCI 232W), and Nazarbayev University (PLS 516). Dissertations from institutions including U.C.L.A., U.C. Berkeley, U. of Minnesota, U. of Michigan, U. Penn., Georgetown, Florida State University, Syracuse, U. of Toronto, U. of Essex, Ghent University, Catholic University, Loyola U. Chicago, Arizona State U., and The Ohio State U. cite the papers.
Public venues have invited me to speak or organize conversations as a keynote presenter, panelist, or committee member on more than 40 occasions, to a total audience of 13,000. My speaking focuses on the application of responsible artificial intelligence to research in business, politics, and economics. Private interests have also invested capital and equity in my research. The Wall Street Journal, Financial Times, Forbes, Bloomberg, Quartz, J.P. Morgan’s sell-side research publications, McSweeney’s, Futurist, Persuasion, and TechCrunch have featured applications of my research in business and society, and I am a U.S. patent-holder for my research applications in generative programming and systematic trading. Companies that use machine learning for AI applications in advertising, computer networking, quantitative finance, education, and call center automation have placed me on their boards of advisors, or in R&D leadership positions. My computer vision software doc2text is in the top 0.1% of open-source projects on Github, and has spawned more than 50 downstream application forks.
Intellectual Contributions
My research program designs methodological techniques and applies them to topics in political polarization. The program has included methods in natural language processing, causal inference, explainability, and other specialities when the work has called for it. My projects always include (i) the development, enrichment, storage, and processing of up to pedabyte-scale databases; (ii) the application of a data science method without which the hypotheses would be un-testable, and; (iii) generalization of the data science method for application in other fields, especially business. These characteristics distinguish my work as interdisciplinary work in computational social science.
Natural Language Processing
One facet of my research program pioneers applications of Sutherland’s delta, a statistic to test for and explain measurement construct validity in analyses that rely on natural language processing (NLP). We used this method to validate our study of nationalization published in the Journal of Politics (JOP), a top-three political science journal. In order to test hypotheses using text data, researchers generate a codebook of phrases they use to estimate (score) each actors’ text on a selected dependent variable, generating a distribution of scores with rank equal to the number of actors in the text. Data scientists automate this process, using supervised machine learning on a separate source of text that already has labels for the dependent variable. The machine learning approach generates the codebook from phrases that are likely to occur when the other texts are high or low on the dependent variable.
Prior to my research there was no way to empirically address the validity of a researcher’s algorithmically generated codebook ex-ante, as there was no notion of the ground truth codebook for the unlabeled text. Sutherland’s delta abstracts codebooks as artifacts of latent trait models. It reconstructs the ground truth of the unobserved codebook using a modification of Latent Dirichlet Allocation (LDA), and compares it to the researcher generated codebook on the basis of topical Kullback–Leibler divergences. This is superior to approaches using Akaike Information Criterion or penalized log likelihood on pooled databases, because Sutherland’s delta does not require predicted labels or other information sharing across text sources. For example, Sutherland’s delta rejects the validity of the codebook used by Gentzkow and Shapiro (2010) to conclude that voters bias the news (and not vice-versa).
Sutherland’s delta is extensible as an a priori test for measurement construct validity in any research application that uses two or more naturally distinct text corpora. Testing for validity is increasingly important for research applying NLP to study social sentiment (for politics, systematic trading applications, chatroom hate speech), information ontologies (including those used in knowledge search and for verification of fake news), and business applications (like call center automation and advertising personalization). Future innovation will abstract the a priori data generating process from text to other unstructured sources like images, and use alternative generative specifications to replace LDA.
Application to Polarization, Nationalization, and Democratic Norms
There can be little doubt that Americans today are deeply divided on their ideological and partisan attachments, their many issue preferences—even their values. What does this polarization imply for government institutions, representation, and democratic norms? How might we mitigate its effects? Should we blame the people, or blame the elites? Scholarly interest in these questions has grown since I published in Political Behavior with Jon Rogowski to provide structure to the measurement and drivers of affective polarization—the gap between individuals’ positive feelings toward their own political party and negative feelings toward the opposing party. We used experimental and observational methods to show that a voter’s affective candidate evaluations are directly responsive to elite polarization.
My research program dives deeper into elite polarization and its effects on voters with three projects. These projects are “evergreen,” generating annual data for regular publication.
State Politics and Policy in the United States
The first project uses annually delivered state policy agendas, to study ideological (and non-ideological) interactions between political elites. In the last eight years, I’ve collaborated with Dan Butler to collect gubernatorial State of the State (SOTS) addresses, which are like State of the Union (SOTU) addresses, but for the states. With a team of six graduate and undergraduate research assistants, we have assembled coverage of the SOTS dating back to the 1880’s, with full coverage from 1960–2022. The research’s first publication in the Journal of Politics demonstrates how state policy agendas have become nationalized by comparing the SOTS to the SOTU (requiring the application of Sutherland’s delta). The analysis reveals that state agendas have become more similar to each other over time, and that state agendas are more similar to the national agenda (as laid out in the SOTU address). The nationalization of U.S. politics is also showing up in the nationalization of the U.S. policy sphere. In 2017, the National Science Foundation’s GRFP recognized this work with an Honorable Mention (a national honor).
We would have been unable to achieve this analysis without three methodological innovations. The first was a practical solution to extract machine readable text from the old and poorly scanned speech transcripts. My computer vision application software doc2text uses Hough line transforms and canny edge detection to extract messy text blocks from images when other software fails. The software, cited in the JOP, ranks in the top 0.1% of open-source software projects on Github. The second method was a generalized approach to the explanation of coherent generative topic models. It uses a hyperparameter grid search to find the topic model that optimizes out-of-sample fit against interpretability, using a blended objective that estimates topic coherence (like Phi, Lift, Relevance, and FREX) and hand-coded reliability scores. The third method was a non-parametric approach to uncertainty estimation for effects estimated using upstream NLP artifacts. The method uses a multi-modal battery of NLP techniques to simulate the set of pre-analysis decisions the researcher could have made, yielding a distribution of potential effects for any final estimate. These extensible methodological innovations provided reviewers with confidence that we were not cherry-picking an NLP approach.
Further research in progress suggests that the similarity between the state and national agendas predicts the degree to which national factors influence gubernatorial elections. This would suggest that voters still engage in issue voting and that the nationalization of gubernatorial elections represents a rational response to the choices that voters face. Another stream in progress considers institutional structures and whether state executives are effective in enacting their policy agendas (and, whether they are punished when they are ineffective). My initial findings suggest that nationalization of state legislatures has generally decreased the production of democratically responsive, salient laws.
Democratic Norms and Elite Polarization in the American Congress
The second project examines behavioral interactions between elites using text and network data, to draw conclusions on polarization, democratic norms, and effective lawmaking in Congress. The research processes terabytes of Congressional Record data to detect Member speech characteristics and behaviors, such as when Members of Congress interrupt each other. The research’s first publication in the American Political Science Review applied these interruption data to demonstrate how women are more likely be interrupted than men in Congressional committee hearings. When discussing women’s issues, women are twice as likely to be interrupted. Because committee hearings are where most work in Congress is done, this decreases the ability of women to have their policy expertise heard.
This work would not have been possible without my program’s methodological innovation in exponential random graph models (ERGMs). ERGMs are generative models for network data that enable the estimation of dyad and edge effects. The ERGMs enabled me to estimate endogenity-robust standard errors for our findings, which are clouded in generalized linear models due to the endogeneity of these dyadic interactions. The ERGMs also enabled me to include Member and edge covariates to find that it was mixed-gender interactions driving the effect, rather than women interrupting themselves. These innovations assuaged reviewers of the veracity of the findings. The computation required for these models was infeasible using existing open-source software and high-spec computer builds, because the networks contained 1,264 Member nodes and 8,207,000 weighted edges. The software also did not implement marginal effects estimation. I wrote new software to make the computation of these feasible. This software will extend our ability to estimate more complex networks in the future.
Freedom and Self-Censorship in the American Public
The third project turns to the role of polarization in interpersonal dynamics, and how polarization affects our democratic norms such as freedom of speech. A person “self-censors” when they feel the cost of expressing their true opinion is so punitively high that they would rather keep their mouth shut, even in the face of potentially harmful dominant opinions. My research documents the grave finding that levels of self-censorship today are triple their level during the Red Scare of the 1960’s—a time when simply appearing to sympathize with the communist persuasion could render you ruined and in jail. General interest in the paper that reports these findings, forthcoming in Political Science Quarterly with Jim Gibson, has been staggering: its Social Science Research Network pre-print is in the top 5% of all papers on the site. In nationally representative survey data collected since 1954, affective polarization covaries with the loss of considerable quantities of freedom to speak, at both the time-series and cross-sectional levels. We propose Noelle Neuman’s spiral of silence – whereby people self-censor increasingly because others’ self-censorship makes them feel as if they are in the minority – as the mechanism, setting up further, more rigorous research. Examining the role of the spiral of silence in self-censorship will require the development of network data and measurement instruments that do not exist today, and at a scale far beyond what today’s ERGMs are capable of feasibly handling. I anticipate developing new, extensible data science methods to develop this research.
Causal Inference and Broader Interdisciplinary Impact
Interdisciplinary, theoretically driven data science methodologies are at the heart of my research program. In addition to the broad work discussed earlier, my program has included workstreams in causal inference, explainable machine learning, development economics, and quantitative finance. These workstreams have contributed to welfare knowledge and have been economically productive in business. The first stream applies causal inference methodologies to run more than 30 RCTs treating more than 170 million individuals. These RCTs are complex, involving voter network data, DMA block-randomization, pedabyte-level clickstream data, addressable or connected TV, GIS-based catchment area analysis, or ex-post design accommodations. The second stream develops explainable machine learning models, including a multi-modal approach to explain high-weight covariates in K-dimensional clustering. The application of this method to terabyte-scale IoT data from developing world solar devices, published in Energy, helped reveal how to structure customer incentives to promote rural electrification. The third stream predicts monetary policy changes as a function of central bank communications. It applies these predictions to earn returns through multi-factor models in quantitative trading applications.