In this document, we will discuss the methods presented in the paper “Do forecasts of bankruptcy cause bankruptcy?” (Papakostas et al. 2023). The paper uses Bayesian additive regression trees (BART) (Chipman, George, and McCulloch 2012). We will reproduce the paper utilizing the stochtree software suite (Herren et al. 2025). An abbreviated version of the abstract is reproduced below1:
1 The actual paper compares the methods presented to other topics such as “E-values”, does a deeper investigation into so called “moderating effects”, and explores the limits of the usefulness of the methodology in a lengthy simulation study. These are great things to read about, but for the sake of brevity, we omit them here.
It is widely speculated that auditors’ public forecasts of bankruptcy are, at least in part, self-fulfilling prophecies in the sense that they might actually cause bankruptcies that would not have otherwise occurred. This conjecture is hard to prove, however, because the strong association between bankruptcies and bankruptcy forecasts could simply indicate that auditors are skillful forecasters with unique access to highly predictive covariates.
Although this question could be addressed with an instrumental variables analysis, the availability of valid instruments is hotly debated. In this paper, we investigate the causal effect of bankruptcy forecasts on bankruptcy using nonparametric sensitivity analysis. We contrast our analysis with two alternative approaches: a linear bivariate probit model with an endogenous regressor, and a recently developed bound on risk ratios called E-values. Additionally, our machine learning approach incorporates a monotonicity constraint corresponding to the assumption that bankruptcy forecasts do not make bankruptcies less likely. Finally, a tree-based posterior summary of the treatment effect estimates allows us to explore which observable firm characteristics moderate the inducement effect.
The methods were motivated by studying the effect of going concern opinions on bankruptcy, though they should be generalizable to a wide variety of problems.
Background
A “going concern opinion’’ is an assessment by an auditor that a firm is at risk of going out of business in the coming year. Here a”concern” refers to a firm, and “going” refers to staying, as opposed to going out of, business. By U.S. accounting law, a public company receiving a going concern opinion must disclose it in their annual filings. See (Maurer 2020) for a recent discussion of going concern opinions in the news. Once issued, a going concern opinion may directly contribute to a firm’s bankruptcy risk, for example, by inducing lenders to pull lines of credit. As reported in (Maurer 2020):
Companies that receive a going-concern audit opinion may be subjected to more rigorous covenant terms or downgrades in their credit ratings, said Anna Pinedo, a partner at law firm Mayer Brown. Fractured relationships with customers could also strengthen a business’s competitors, she said.
Estimating the magnitude of such an “inducement effect” is complicated by the unavailability of the auditors’ private information to the analyst. That is, in addition to publicly available firm information, auditors have access to “private information” gleaned from confidential documents and via firsthand knowledge of undocumented attributes such as the firm’s corporate culture. This document presents the methods whose development that was motivated by the question: do going-concern opinions help to predict bankruptcy because they incorporate the auditor’s private information or because of an inducement effect? This is a textbook example of causal inference where the potential unobserved confounders are particularly pictureseque: what do auditors know that we (the analysts) do not?2
2 Cryptic…
Terminology
Denote the treatment variable by \(G_i\) for “going concern” so that \(G_i = 1\) for the \(i\)th firm in our sample if that firm disclosed a going concern opinion in the prior year.
Denote the outcome variable \(B_i\) for “bankrupt” so that \(B_i = 1\) filed for bankruptcy, In terms of potential outcomes (Rubin 1974), we are interested in two scenarios: \(B^1_i\) and \(B^0_i\), which are respectively the outcome of a firm \(i\) if it had received the treatment and if it had not received the treatment, respectively; only one of these potential outcomes is observed.
Throughout the document, we will call the \(B^0\) potential outcome the “prognostic” case (what would have happened sans-treatment). We also interchangeably use the term “treatment” and “inducement” to describe the causal estimand that relates \(B^1\) and \(B^0\)3.
3 This can be either the difference or the ratio of the two. The ratio is useful in this problem because some companies are at such low risk of bankruptcy that differences will be miniscule. At the same time, ratios can be really large when there is a small number in the denominator, even if the causal effect is not particularly significant in the eyes of most interested people.
What’s the issue?
The fundamental problem of causal inference (Holland 1986) is that \((B^1, B^0)\) are never observed simultaneously, rather only one or the other is observed. Consequently, the conditions under which causal effect can be estimated must be carefully assessed and their plausibility debated. There are three widely used methods for estimating average treatment effects: randomization, regression adjustment (broadly construed to include matching and propensity score based methods), and instrumental variables analysis. To briefly review:
In a randomized controlled trial the treatment variable — \(G\) in the present context — is independent of the potential outcomes \(B^1\) and \(B^0\); in this case \(E(B^1) = E(B \mid G = 1)\), the right hand side of which is readily estimable from observed data (and likewise for the \(G = 0\) case).
When a randomized experiment is not possible (such as in the present example) one instead may hope to find a set of control variables \(\mathbf{x}\) for which \(E(B^1 \mid \mathbf{x}) = E(B \mid G = 1, \mathbf{x})\) and \(E(B^0 \mid \mathbf{x}) = E(B \mid G = 0, \mathbf{x})\), in which case treatment effects can be estimated by estimating these conditional expectations via regression modeling. This condition in called conditional ignorability or, alternatively, \(\mathbf{x}\) are said to satisfy the back-door criterion.
A third possibility is that a sufficient set of controls is unavailable, but an instrument for the treatment assignment is available. An instrument is a variable that is causally related to the treatment but not otherwise associated with the response variable. In the current context, an instrument variable (IV) would be a one that affects the probability that an auditor issues a going concern opinion without directly affecting bankruptcy probabilities or sharing common causes with bankruptcies.
In the present context, none of these three approaches are available. A sufficient set of controls is certainly not readily available and the existence of a valid instrument is doubtful because firms choose their own auditing agency, rendering auditor attributes endogenous. Although there are other approaches — such as regression continuity design, see (Imbens and Lemieux 2008) and (Thistlethwaite and Campbell 1960), difference-in-differences, see (Card and Krueger 1994), and the synthetic control method see (Abadie, Diamond, and Hainmuller 2010) and (Abadie and Gardeazabal 2003) — they apply in idiosyncratic settings that do not apply to the bankruptcy inducement problem4.
4 So it goes.
Figure 1: A visual to keep in mind. We have both observed and unobserved confounding.
Traditional approaches
A well-known model that has been used for problems similar to the one described here is the bivariate probit with endogenous predictor, see Section~ 15.7.3 of (Woolridge 2010). This model can be expressed in terms of bivariate Gaussian latent utilities \(Z_g\) and \(Z_b\) that related to going concern opinions and bankruptcy, respectively: \[
\begin{pmatrix}
Z_{g,i}\\
Z_{b,i}
\end{pmatrix}\stackrel{\text{iid}}{\sim}\mathcal{N}(\mathbf{\mu}, \Sigma)
\qquad \mathbf{\mu}=\begin{pmatrix}
\beta_0+\beta_1\mathbf{x}_i\\
\alpha_0+\alpha_1\mathbf{x}_i
\end{pmatrix}
\qquad
\mathbf{\Sigma}=\begin{pmatrix}
1&\rho\\
\rho&1
\end{pmatrix}
\tag{1}\]
The premise of this model is that \(\rho\) reflects the influence of private information available to the auditor but not the researcher, and \(\mathbf{x}_i\) represents covariates of a company that is available to both the auditor and to the researcher. The observed binary indicators, \(G\) and \(B\), relate to these latent utilities via \[\begin{align}
G&=\textbf{1}\{Z_{g,i}\geq 0\}\\
B&=\textbf{1}\{Z_{b,i}\geq -\gamma G\}
\label{align1}
\end{align}\] The coefficient \(\gamma\) governs the strength of the inducement effect.
The basic identification strategy can be motivated geometrically. Let \[\mathbf{\Pi}=\begin{pmatrix}
\pi_{01}&\pi_{11}\\
\pi_{00}&\pi_{10}
\end{pmatrix}\] where \(\pi_{jk}=\Pr(B=j, G=k)\), which describes the four scenarios resulting from our equations for \(G\) and \(B\). The figures below give visual representations of the \(\mathbf{\Pi}\) matrix.
Figure 2
Figure 3
Note in Figure 2 and Figure 3 that \(\mathbf{\mu}\) determines the location (center of ellipse) and the correlation \(\rho\) determines the tilt and concentration of the probability contours. Inducement introduces an extra parameter which lowers the threshold for bankruptcy by \(\gamma\).
The bivariate probit entails ellipse shaped probability contours, where (when \(\gamma=0\)) the probability mass associated to each quadrant represents the four combinations of the bivariate binary observed variables \((B, G)\). The shaded region in the right panel, labeled “A”, is subtracted from the upper left quadrant and added to the upper right quadrant when a going concern is issued, thus reflecting the endogeneity of the going concern variable. The parameters \(\rho\) and \(\gamma\) are estimable because changes in the shape of the ellipses, governed by \(\rho\), lead to distinct apportioning of probability than do changes in the width of the A region, governed by \(\gamma\).
Python code to reproduce the bivariate normal ellipse plots
import scipyimport numpy as npimport matplotlib.pyplot as plimport scipy.stats as st#number of observations could matter here#rho is 0.420 from Franks codedata = np.random.multivariate_normal((1., 1), [[1, .420], [.420, 1]], 1000000)x = data[:, 0]y = data[:, 1]xmin, xmax =-4, 3ymin, ymax =-3, 3.2# Peform the kernel density estimatexx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]positions = np.vstack([xx.ravel(), yy.ravel()])values = np.vstack([x, y])kernel = st.gaussian_kde(values)f = np.reshape(kernel(positions).T, xx.shape)fig = pl.figure()ax = fig.gca()ax.set_xlim(xmin, xmax)ax.set_ylim(ymin, ymax)# Contourf plotcfset = ax.contourf(xx, yy, f,cmap='Blues')#colors='white')## Or kernel density estimate plot instead of the contourf plotax.imshow(np.rot90(f), cmap='Blues', extent=[xmin, xmax, ymin, ymax])# Contour plotcset = ax.contour(xx, yy, f, colors='k', linewidths=0.5)# Label plot#how did we get gamma estimate?#it is the coefficient on gamma!#sorry going concern!gamma=0.19ax.clabel(cset, inline=1, fontsize=10)ax.set_xlabel('Bankruptcy Score')ax.set_ylabel('Going Concern Score')ax.hlines(0, 0, 6, lw=1, color='black')ax.hlines(0, 0, -6, lw=1, color='black')ax.vlines(0, 0, 6, lw=1, color='black')ax.vlines(0, 0, -6, lw=1, color='black')#ax.vlines(-1*gamma, 0, 4, lw=1, color='black')ax.text(-2, -2, '$\pi_\mathrm{00}$', fontsize=12)ax.text(2, -2, '$\pi_\mathrm{10}$', fontsize=12)ax.text(-2, 2, '$\pi_\mathrm{01}$', fontsize=12)ax.text(2.132, 2.018, '$\pi_\mathrm{11}$', fontsize=12)#ax.text(-1, 2.5, 'A', fontsize=12)##ax.text(-1*gamma-0.24, -0.335, '$-\gamma$', fontsize=12)import matplotlib.transforms as mtransformstrans = mtransforms.blended_transform_factory(ax.transData, ax.transAxes)y1positive=y>0d = scipy.zeros(len(y))#ax.fill_betweenx(y, -1*gamma, 0, where= y>=d, # facecolor='grey', alpha=0.5)#pl.ylabel('Going Concern score: $\Phi^{-1}(Pr(G\mid \mathbf{x}))$')pl.xlabel('Bankruptcy score: $\Phi^{-1}(Pr(B\mid \mathbf{x}))$')pl.show()fig = pl.figure()ax = fig.gca()ax.set_xlim(xmin, xmax)ax.set_ylim(ymin, ymax)# Contourf plotcfset = ax.contourf(xx, yy, f,cmap='Blues')#colors='white')## Or kernel density estimate plot instead of the contourf plotax.imshow(np.rot90(f), cmap='Blues', extent=[xmin, xmax, ymin, ymax])# Contour plotcset = ax.contour(xx, yy, f, colors='k', linewidths=0.5)# Label plot#how did we get gamma estimate?#it is the coefficient on gamma!gamma=0.19ax.clabel(cset, inline=1, fontsize=10)ax.set_xlabel('Bankruptcy Score')ax.set_ylabel('Going Concern Score')ax.hlines(0, 0, 6, lw=1, color='black')ax.hlines(0, 0, -6, lw=1, color='black')ax.vlines(0, 0, 6, lw=1, color='black')ax.vlines(0, 0, -6, lw=1, color='black')#ax.vlines(-1*gamma, 0, 4, lw=1, color='black')ax.text(-2, -2, '$\pi_\mathrm{00}$', fontsize=12)ax.text(2, -2, '$\pi_\mathrm{10}$', fontsize=12)ax.text(-2, 2, '$\pi_\mathrm{01}$', fontsize=12)ax.text(2.132, 2.018, '$\pi_\mathrm{11}$', fontsize=12)ax.text(-.7, 2.5, 'A', fontsize=12)#ax.text(-1*gamma-0.24, -0.335, '$-\gamma$', fontsize=12)import matplotlib.transforms as mtransformstrans = mtransforms.blended_transform_factory(ax.transData, ax.transAxes)y1positive=y>0d = scipy.zeros(len(y))ax.fill_betweenx(y, -1*gamma, 0, where= y>=d, facecolor='grey', alpha=0.5)pl.ylabel('Going Concern score: $\Phi^{-1}(Pr(D\mid \mathbf{x}))$')pl.xlabel('Bankruptcy score: $\Phi^{-1}(Pr(Y\mid \mathbf{x}))$')pl.show()
Re-representation of the bivariate probit model
In this section we propose our new approach for machine learning-based sensitivity analysis by generalizing the bivariate probit model given in Equation 1. We begin by defining the joint probability of treatment and outcome as
\[\begin{equation}
\Pr\left(B, G\mid \mathbf{x}\right)=\int_{\mathbb{R}} \Pr\left(B\mid \mathbf{x},U=u,G\right)\Pr\left(G\mid \mathbf{x},U=u\right)f(u)\text u
\label{eq2}
\end{equation}\]
for latent variable \(U\). In this formulation, \(U\) has two special properties. First, it is assumed to be the orthogonal component of the private information in the sense that \(U \perp X\), hence \(\mathbf{x}\) does not appear in \(f(u)\). Second, \(U\) is assumed to be complete, in the sense that \(\Pr(B \mid \mathbf{x}, u, G)\) can be interpreted causally in \(G\), because \(U\) is a sufficient control variable. That is, \(\Pr(B^1 \mid \mathbf{x}, u) = \Pr(B \mid \mathbf{x}, \text{do}(G = 1), u) = \Pr(B \mid \mathbf{x}, G = 1, u)\) and similarly for \(G = 0\); accordingly, the inducement effect for firm \(i\) is
Because the outcome and treatment are both binary, we can expand this probability into its four constituent parts. For convenience, we specify a probit link, yielding
Therefore, in terms of \(f\), \(b_1\), \(b_0\) and \(g\), the individual inducement effect for firm \(i\) is \[\begin{align}
\tau(\mathbf{x}_i)=\int_{\mathbb{R}} \Phi\left(b_1(\mathbf{x})+u\right)f(u)\text{du} -\int_{\mathbb{R}} \Phi\left(b_0(\mathbf{x})+u\right) f(u)\text{du}
\label{treateq}
\end{align}\]
and we denote the sample average inducement effect as \(\bar{\tau} = \frac{1}{n}\sum_{i=1}^{n}\tau(\mathbf{x}_i)\). Importantly, the orthogonality and completeness of \(U\), as well as the choice of the probit link, are not substantive assumptions, as \(U\) is unobserved and \(b_1\), \(b_0\) and \(g\) are nonparametric functions of \(\mathbf{x}\). Rather, these assumptions define\(U\) and give the specification of \(f(\cdot)\) meaning; the choice of \(f\), therefore, is a substantive assumption (as it is in the bivariate probit model as well).
This formulation entails that as \(u\rightarrow -\infty\), the probability of bankruptcy approaches 0, regardless of whether the treatment is administered or not. As \(u\rightarrow \infty\), the probability of bankruptcy approaches 1. The special case \(u=0\) corresponds to no unobserved confounding and the inducement effect can be computed directly from the observed joint probabilities.
Finally, because \(G\) and \(B\) must have a valid joint distribution at each \(\mathbf{x}\) value, we have the following system of equations defining our data generating process (boxed in to highlight that this is the main model):
We choose the link function \(\Phi\) to be a normal cumulative distribution function for reasons to be discussed soon.
The left hand side of the system above — the reduced form5 parameters — can be estimated from the observed data. Any of a host of machine learning classification methods, can be used to obtain estimates of these probabilities. Here we focus our attention on BART for two reasons: one, we can impose monotonicity so that going concerns can only increase the probability of bankruptcy, and two, we obtain a Bayesian measure of uncertainty based on Markov chain Monte Carlo sampling methods.
5 By reduced form, we mean we are looking at parameters that are functions of the observable data. Estimation follows by your choice of regression model.
What remains is to solve for \(b_1(\cdot),b_0(\cdot), g(\cdot)\), the structural, or causal, parameters. To do so, we take a numerical approach, by minimizing the sum of the squared distance between the three left-hand right-hand pairs in the boxed model above, see Equation 2:
Although it is unclear that the main model has a unique solution in \(b_1\), \(b_0\), \(g\), numerical solvers converge readily in our experience. Heuristically, as a convex combination of monotone functions, each of the individual integrals is likely to be nearly linear over much of its domain. Note that the use of the normal inverse CDF simply ensures that the range of our objective function is unbounded; we observe that this improves numerical stability of our solver.
We refer to this process as modular because it requires fitting the reduced form model just one time. Sensitivity of the causal estimates to different choices of \(f\) can be assessed independently using the same estimates (or posterior samples) from a single reduced form model fit.
Connecting the models
Conveniently, the bivariate probit model and the more flexible model Equation 2 we suggest can actually be related to one another. This is because of the chosen “link function”, \(\Phi(\cdot)\), we mentioned earlier.
it certainly looks like we could average over the hidden variable \(U\) and obtain a bivariate normal distribution that looks just like the bivariate probit model above.
Defining \(b_0(\mathbf{x}) = \alpha_0 + \alpha_1 \mathbf{x}\), \(b_1(\mathbf{x}) = \alpha_0 + \alpha_1 \mathbf{x} + \gamma\), and \(g(\mathbf{x}) = \beta_0 + \beta_1 \mathbf{x}\) shows how we can match the mean terms between the models. It is more difficult to relate the hidden variable \(U\) to the \(\rho\) parameter (which governs the hidden confounding in the bivariate probit model), but doable. To begin, we invoke a common statistical trick and introduce two new variables, \(\tilde{G}\) and \(\tilde{B}\) that are conditional on \(U\). The goal here is to find the distribution of \(U\) that relate \(\tilde{G}\) and \(\tilde{B}\) to \(Z_{g,i}\) and \(Z_{b,i}\) in the bivariate probit formulation above. If we look at the \(\Pr(B=1, G=1\mid \mathbf{x},U)\) equation, we can write:
where \(U\sim N(0, \sigma^2)\) and \(\varepsilon_{1}\) and \(\varepsilon_{2}\) are independent standard normal random variables (i.e. \(\varepsilon_{1}, \varepsilon_{2}\sim N(0,1)\)).
We could calculate that integral using properties of the normal distribution, but the transformation of variables we introduced makes the calculation easier. If we choose a certain variance for \(U\)6, then we can match the mean and covariance matrices from the original bivariate probit and this new one we defined that is conditional on \(U\). To do so, we begin by noticing that the \(\varepsilon\) terms zero out in the means as we want, as does the contribution from \(U\), which is defined to be mean 0. Notice, \(\text{var}(\tilde{G})=\text{var}(\tilde{B})=1+\sigma^2\), using standard calculations. For the off-diagonal terms, i.e. the covariance between \(\tilde{G}\) and \(\tilde{B}\), we find that:
6 Under the important assumption that is distributed \(N(0,\sigma^2)\).
Where the zero covariance terms are due to the independence of \(U\) and the \(\varepsilon\) terms and the independence of the \(\varepsilon\) terms with each other. The mean terms are deterministic and therefore have zero covariance, so we can functionally ignore them and not write out the tedious calculations of showing them all not contribute to zero.
To check this, run the following code and ensure that the diagonals equal \(2^2+1=5\), and the off diagonals equal \(2^2=4\)7.
7 We will have some “Monte Carlo” simulation error, but given the large sample size, it should be pretty close.
Click here for full code
N =100000sigma =2eps1 =rnorm(N,0,1)eps2 =rnorm(N,0,1)U =rnorm(N, 0, sigma)G = U + eps1B = U + eps2cov_mat =matrix(c(round(var(G),3), round(cov(G,B),3), round(cov(G,B),3), round(var(B),3)), nrow=2)cov_mat =data.frame(cov_mat)colnames(cov_mat) =c(paste0(expression(tilde(G))), 'B~')rownames(cov_mat) =c('G~', 'B~')cov_mat
tilde(G) B~
G~ 5.006 3.995
B~ 3.995 4.994
Looks good. So now we’s like to compare the joint distribution of \(\tilde{G}\) and \(\tilde{B}\) to the bivariate probit model above, but we still need to make sure \(\tilde{G}\) and \(\tilde{B}\) have variance \(1\). This means dividing through the covariance matrix by \(\sqrt{1+\sigma^2}\) to complete the transformation. This means
which implies that \(\rho\) is equal to \(\frac{\sigma^2}{1+\sigma^2}\). So if we rewrite \(U\sim N(0, \sigma^2)\) in terms of \(\rho\) to match the bivariate probit model, we find that we recover the bivariate probit exactly if
Our formulation is quite a lot more flexible than the traditional bivariate probit: we relax the Gaussian assumption on the marginal distribution of \(U\), drop the parallel relationship between \(b_0(\cdot)\) and \(b_1(\cdot)\), and allow \(b_1\), \(b_0\) and \(g\) to be nonlinear8. The price of the extra flexibility of our relaxed specification is that \(f(u)\) is now unidentified, since we cannot learn the distribution on \(U\) even with infinite data. In contrast, identification of the correlation parameter \(\rho\) in the bivariate probit is possible, albeit at the cost of significantly less modeling flexibility.
8 Observe that when the form of \(b_1(\cdot)\), \(b_0(\cdot)\), and \(g(\cdot)\) are constrained, as in the linear probit model, the choice of the probit link becomes a substantive modeling assumption, while in our more flexible formulation it is merely a convenience.
BART: The engine behind it all
Shifting gears, we now turn to how we go about actually doing what talked about doing above. We have discussed that we want to flexibly estimate the left hand side of our main model equations9. Any machine learner worth its salt must find the delicate balance between learning complicated relationships between inputs and outputs while not chasing down noise. Really, many modern classification algorithms can do this, but we prefer the Bayesian additive regression tree model for reasons detailed below.
9 There is an important modeling choice to make when fitting the left hand side. One can either fit two separate models to the \(G=1\) and the \(G=0\) subset data, then “test” on the full input training data with both \(G=1\) and \(G=0\) observations. Or, we can include \(G\) as a covariate and then test on all \(G=1\) or all \(G=0\) to get our estimate for \(\Pr(B\mid G=1,\mathbf{x})\) and \(\Pr(B=1\mid G=0, \mathbf{x})\). We found the latter to work better in simulations for this document, as the former tended to “overfit” the groups and assign too high of probabilities for \(\Pr(B=1\mid G=1, \mathbf{x})\) and too low for \(\Pr(B=1\mid G=0, \mathbf{x})\).
We will introduce a monotone BART variant that allows us to sidestep this choice, but if we do not want to assume a priori that a going concern makes a bankruptcy more likely, then we would have to make the modeling choice above.
The tl;dr is that BART is a carefully constructed prior that is well equipped for prediction tasks. At its core, it represents an outcome \(y\mid \mathbf{x}\) through sum of regression trees (a “forest”). More formally:
\[
y = \underbrace{E(y\mid \mathbf{x})}_{\text{signal}}+\underbrace{\varepsilon}_{\text{noise}}=\underbrace{f(\mathbf{x})}_{\text{sum of trees}}+\varepsilon
\]
BART comes with the following advantages:
BART is a Bayesian method and provides natural uncertainty quantification, which will come in handy later.
BART is easier to modify than competitors like xgboost and random forest. Given the very reasonable monotonicity constraint we would like to incorporate, this is a big plus.
BART actually performed the best in prediction classification tasks on an actual bankruptcy dataset. Details can be found in the appendix of (Papakostas et al. 2023).
Finally, this document is lowkey an advertisement for the excellent stochtree software suite, which makes implementation fast and straightforward.
Details into the machinery of BART are provided in the notes below: On desktop, hover mouse over slides and press f to enter full-screen mode. Can also enter full-screen mode by clicking on menu in bottom left corner of slides. For slide deck with many interesting applications of BART, see here.
(Albert and Chib 1993) provide the blueprint for a probit BART. This is useful if you want to “classify” a yes/no binary outcome. For a binary outcome:
The latent \(Y^*\) variables are drawn \(N(f(\mathbf{x}),1)\), so the variance parameter \(\sigma^2\) is now fixed at 1. Therefore, \(\Pr(Y=1\mid \mathbf{x})=\Phi(f(\mathbf{x}))\), where \(\Phi\) is the standard normal CDF. So we observe \(y_i\) as 1’s or 0’s, but do not observe \(z_i\), the latent variable for which the classification of \(y_i\) is determined.
The Gibbs scheme looks like this:
The BART forest is sampled given the latent outcome \(Z_i\). The means from the forests are referred to as \(\eta_{1}\) and \(\eta_{0}\).
The latent outcome \(Z_i\) terms are sampled from normal distributions given the observed outcome and the \(\eta_{1}\) (if \(y=1\) and \(\eta_{0}\) (if \(y=0\)) terms.
where \(m\) signifies the number of trees, \(T_j\) are the trees in a forest, and \(M_j\) are the parameters associated with a tree \(j\), such as the mean in a bottom node.
Monotonicity constraint
We want \(\Pr(B|G=1,\mathbf{x})\geq \Pr(B|G=0,\mathbf{x})\), so we parameterize \(\Pr(B=1\mid G, \mathbf{x})\) as follows:
\[\begin{equation}
\begin{split}
\Pr(B=1\mid G=1, \mathbf{x}) &= \Phi[h_1(\mathbf{x})],\\
\Pr(B=1\mid G=0, \mathbf{x}) &= \Phi[h_0(\mathbf{x})]\Pr(B=1\mid G=1, \mathbf{x}),
\\
&= \Phi[h_0(\mathbf{x})]\Phi[h_1(\mathbf{x})],\\
\Pr(G=1\mid \mathbf{x}) &= \Phi[w(\mathbf{x})].
\end{split}
\end{equation}\] Where \(h_0(\mathbf{x})\) and \(h_1(\mathbf{x})\) are assigned BART priors.
Unfortunately, the \(\Phi[h_0(\mathbf{x})]\Phi[h_1(\mathbf{x})]\) term will be a problem. When we write the likelihood function across the firms (indexed by \(i\)), we will find the \(\Pr(B=1, G=0\mid \mathbf{x})\) term will look like \(\prod_{G_i=0, i}\left(\Phi[h_0(\mathbf{x})]\Phi[h_1(\mathbf{x})])\right)^{B_i}\left(1-\Phi[h_0(\mathbf{x})]\Phi[h_1(\mathbf{x})])\right)^{1-B_i}\). If we take the log of this term, it is clear that the first product will factor well into additive terms on the log-scale. However, the \(\left(1-\Phi[h_0(\mathbf{x})]\Phi[h_1(\mathbf{x})])\right)^{1-B_i}\) term continues to be a headache. The solution: we use a data-augmented representation that permits updating \(h_0\) and \(h_1\)independently using standard MCMC for probit BART.
To begin, note that the first term above (corresponding to \(G=1\)) involves only \(h_1\) so we only need to augment data in the \(G=0\) “arm”. When \(G=0\), we relate \(B\) to two independent binary latent variables \(R_0\) and \(R_1\) via \(B=R_0\cdot R_1\) , which leads to the following expressions: \[
\begin{align}
\Pr(R_0=1\mid\mathbf{x}, G=0) &= \Phi(h_0(\mathbf{x})),\\
\Pr(R_1=1\mid\mathbf{x}, G=0) &= \Phi(h_1(\mathbf{x}))\\
B &= 1 \text{ if } R_{0} = R_{1} = 1\text{ and } 0 \text{ otherwise.} \\
B&=R_0R_1
\end{align}
\]
In terms of \(B_1\) and \(B_0\), if we were to integrate over \(R_0\) and \(R_1\) respectively, we would see that
\(\Pr(B=1\mid\mathbf{x}, G=0) = \Phi(h_0(\mathbf{x}))\Phi(h_1(\mathbf{x}))\) and \(\Pr(B=0\mid\mathbf{x}, G=0) = 1-\Phi(h_0(\mathbf{x}))\Phi(h_1(\mathbf{x}))\), which matches the expressions we had above. \(R_0\) and \(R_1\) serve to make the likelihood function factorable, which then enables standard BART probit MCMC calculation. We can see this when we include the latent variables, as the augmented likelihood function for a bankruptcy becomes:
After rearranging terms, we have two separate probit likelihoods in \(h_0\) and \(h_1\) (and the domain restriction in the last term). Conditional on \(R_0, R_1\) we can update \(h_0, h_1\) using standard probit BART MCMC steps. To update the latent variables \(R_{0i}\) and \(R_{1i}\), first note that they are fixed at 1 when \(B_i=1, G_i=0\). When \(B_i=0, G_i=0\), \(R_i\equiv (R_{0i}, R_{1i})\) is sampled from:
Notice, this is the joint probability distribution of the latent variables that was specified by:
\[
\begin{align}
\Pr(R_0=1\mid\mathbf{x}, G=0) &= \Phi(h_0(\mathbf{x})),\\
\Pr(R_1=1\mid\mathbf{x}, G=0) &= \Phi(h_1(\mathbf{x}))
\end{align}
\]albeit truncated away from the \(R_{0i} = R_{1i}=1\) region. stochtree simplifies writing up this bespoke sampler. The samplng procedure described above is a data augmented Gibbs sampler. Without going into too much detail, this essentially means we can sample the elements in the model iteratively. First, we sample the BART forests (the \(h_0\) and \(h_1\) terms), then we sample \(R_0\) and \(R_1\) (giving the latent means \(\eta_{h_0}\) and \(\eta_{h_1}\) estimated by the BART forests), and finally sampling the latent variables (\(z_i\) in the notation from the probit model), given \(R_0\), \(R_1\), and the \(\eta_{h_0}\) and \(\eta_{h_1}\) terms). This procedure constitutes one “iteration” or step of the sampling procedure, which is repeated many times to explore the posterior of \(\Pr(B\mid \mathbf{x}, G)\).
Having the BART forests on hand at each step in the sampler in an R/Python interface greatly simplifies this process. Otherwise, you would have to edit C++ code to ensure the BART forests are updated at each iteration with the new values of the other variables in the Gibbs sampler, which is a painstaking process if you are not well versed in a difficult programming language like C++.
options(warn=-1)suppressMessages(library(stochtree))mon_probit =function(X, g, b){# Set number of iterations num_warmstart <-20 num_mcmc <-500 num_burnin =0 num_samples <- num_burnin + num_warmstart + num_mcmc num_trees_h1 <-40 num_trees_h0 <-40 alpha =0.95 beta =2 feature_types <-as.integer(rep(0, ncol(X))) # 0 = numeric var_weights <-rep(1/ncol(X),ncol(X)) cutpoint_grid_size =100 global_variance_init =1. tau_init =1/num_trees_h0 leaf_prior_scale =matrix(c(tau_init), ncol =1) nu <-4 lambda <-0.5 a_leaf <-2. b_leaf <-0.5 leaf_regression <- F # fit a constant leaf mean BART model min_samples_leaf <-2 max_depth <-15# Random number generator (std::mt19937) random_seed =123if (is.null(random_seed)) { rng <-createCppRNG(-1) } else { rng <-createCppRNG(random_seed) } X_h0 <-as.matrix(X[g==0,]) n0 <-sum(g==0) n1 <-sum(g==1)# Instantiate the C++ dataset objects forest_dataset_h0 <-createForestDataset(X_h0) forest_dataset_h1 <-createForestDataset(X)# No leaf regression outcome_model_type <-0# Set up model configuration objects forest_model_config_h1 <-createForestModelConfig(feature_types = feature_types, num_trees = num_trees_h1, num_features =ncol(X),num_observations =nrow(X), variable_weights = var_weights, leaf_dimension =1,alpha = alpha, beta = beta, min_samples_leaf = min_samples_leaf,max_depth = max_depth, leaf_model_type = outcome_model_type,leaf_model_scale = leaf_prior_scale, cutpoint_grid_size = cutpoint_grid_size ) forest_model_config_h0 <-createForestModelConfig(feature_types = feature_types, num_trees = num_trees_h0, num_features =ncol(X_h0),num_observations =nrow(X_h0), variable_weights = var_weights, leaf_dimension =1,alpha = alpha, beta = beta, min_samples_leaf = min_samples_leaf,max_depth = max_depth, leaf_model_type = outcome_model_type,leaf_model_scale = leaf_prior_scale, cutpoint_grid_size = cutpoint_grid_size ) global_model_config <-createGlobalModelConfig(global_error_variance =1)# Instantiate the sampling data structures forest_model_h1 <-createForestModel(forest_dataset_h1, forest_model_config_h1, global_model_config) forest_model_h0 <-createForestModel(forest_dataset_h0, forest_model_config_h0, global_model_config)# Instantiate containers of forest samples forest_samples_h1 <-createForestSamples(num_trees_h1, 1, T) forest_samples_h0 <-createForestSamples(num_trees_h0, 1, T)# Instantiate "active" forests active_forest_h1 <-createForest(num_trees_h1, 1, T) active_forest_h0 <-createForest(num_trees_h0, 1, T)# Initialize the Markov chain# Initialize (R0, R1), the latent binary variables that enforce the monotonicty b1 <- b[g==1] b0 <- b[g==0] R1 =rep(NA,n0) R0 =rep(NA,n0) R1[b0==1] <-1 R0[b0==1] <-1 R1[b0 ==0] <-sample(c(0,1),sum(b0==0),replace=TRUE) R0[b0 ==0] <-0# The first n1 observations of b_aug are actually observed# The next n0 of them are the latent variable R1 b_aug <-c(b1, R1)# Initialize the Albert and Chib latent Gaussian variables g_h1 <- (2*as.numeric(b_aug) -1) g_h0 <- (2*as.numeric(R0)-1) g_h1 <- g_h1/sd(g_h1) g_h0 <- g_h0/sd(g_h0)# Pass these variables to the BART models as outcome variables outcome_h1 <-createOutcome(g_h1) outcome_h0 <-createOutcome(g_h0)# Initialize active forests to constant (0) predictions active_forest_h1$prepare_for_sampler(forest_dataset_h1, outcome_h1, forest_model_h1, outcome_model_type, 0.0) active_forest_h0$prepare_for_sampler(forest_dataset_h0, outcome_h0, forest_model_h0, outcome_model_type, 0.0) active_forest_h1$adjust_residual(forest_dataset_h1, outcome_h1, forest_model_h1, FALSE, FALSE) active_forest_h0$adjust_residual(forest_dataset_h0, outcome_h0, forest_model_h0, FALSE, FALSE)# PART IV: run the Markov chain# Initialige the Markov chain with num_warmstart grow-from-root iterations gfr_flag <- Tfor (i in1:num_samples) {# Switch over to random walk Metropolis-Hastings tree updates after num_warmstartif (i > num_warmstart) { gfr_flag <- F }# Step 1: Sample the BART forests# Sample forest for the function h1 based on (b_1, R1) forest_model_h1$sample_one_iteration( forest_dataset_h1, outcome_h1, forest_samples_h1, active_forest_h1, rng, forest_model_config_h1, global_model_config,keep_forest = T, gfr = gfr_flag )# Sample forest for the function h0 based on outcome R0 forest_model_h0$sample_one_iteration( forest_dataset_h0, outcome_h0, forest_samples_h0, active_forest_h0, rng, forest_model_config_h0, global_model_config,keep_forest = T, gfr = gfr_flag )# Extract the means for use in sampling the latent variables eta_h1 <- forest_samples_h1$predict_raw_single_forest(forest_dataset_h1, i-1) eta_h0 <- forest_samples_h0$predict_raw_single_forest(forest_dataset_h0, i-1)# Step 2: sample the latent binary pair (R0, R1) given eta_h0, eta_h1, and b_0# Three cases: (0,0), (0,1), (1,0) w1 <- (1-pnorm(eta_h0[b0==0]))*(1-pnorm(eta_h1[n1 +which(b0==0)])) w2 <- (1-pnorm(eta_h0[b0==0]))*pnorm(eta_h1[n1 +which(b0==0)]) w3 <-pnorm(eta_h0[b0==0])*(1-pnorm(eta_h1[n1 +which(b0==0)])) s <- w1 + w2 + w3 w1 <- w1/s w2 <- w2/s w3 <- w3/s u <-runif(sum(b0==0)) temp <-1*(u < w1) +2*(u > w1 & u < (w1 + w2)) +3*(u > (w1 + w2)) R1[b0==0] <-1*(temp==2) R0[b0==0] <-1*(temp==3)# Redefine b with the updated R1 component b_aug <-c(b1, R1)# Step 3: sample the latent normals, given (R0, R1) and b_aug# First g0 U1 <-runif(sum(R0),pnorm(0, eta_h0[R0==1],1),1) g_h0[R0==1] <-qnorm(U1, eta_h0[R0==1],1) U0 <-runif(n0 -sum(R0),0, pnorm(0, eta_h0[R0==0],1)) g_h0[R0==0] <-qnorm(U0, eta_h0[R0==0],1)# Then g1 U1 <-runif(sum(b_aug),pnorm(0, eta_h1[b_aug==1],1),1) g_h1[b_aug==1] <-qnorm(U1, eta_h1[b_aug==1],1) U0 <-runif(n -sum(b_aug),0, pnorm(0, eta_h1[b_aug==0],1)) g_h1[b_aug==0] <-qnorm(U0, eta_h1[b_aug==0],1)# Propagate the updated outcomes through the BART models outcome_h0$update_data(g_h0) forest_model_h0$propagate_residual_update(outcome_h0) outcome_h1$update_data(g_h1) forest_model_h1$propagate_residual_update(outcome_h1)# No more steps, just repeat a bunch of times }# Forest predictions# The h1 dataset includes all the X, which is where we care to test here. preds_h0 <-pnorm(forest_samples_h0$predict(forest_dataset_h1)) preds_h1 <-pnorm(forest_samples_h1$predict(forest_dataset_h1))#bad practice in python, fine in R# Since we estimated h1 and h0, multiply them together to ensure h0<h1# since they are probabilities on the 0-1 scale preds_smaller <- preds_h1*preds_h0 preds_larger <- preds_h1return(list(B0=data.frame(B0 = preds_smaller[, num_warmstart:num_samples]),B1 =data.frame(B1 = preds_larger[,num_warmstart:num_samples])))}
With the monotone BART helper function done, we now embed it inside another helper function which uses BART to the learn the three probabilities we need. The main trick here is to convert from conditional and marginal probabilities ( \(\Pr(B\mid G,\mathbf{x})\) and \(\Pr(G\mid \mathbf{x})\) to the joint probabilities from our system of equations in our model. Conveniently, the probit model is built into stochtree (see here).
Helper function for the BART predictions
options(warn=-1)BARTpred=function(df, treat='G', Outcome='B',vars, mono=T,index=4, data_full=T){ covtreat=df[df[treat]==1,] covcontrol=df[df[treat]==0,]#case 1, the treatmentif (mono==F){ num_warmstart <-20 num_burnin <-0 num_mcmc <-500 num_samples <- num_warmstart + num_burnin + num_mcmc general_params <-list(sample_sigma2_global = F, probit_outcome_model = T) mean_forest_params <-list(sample_sigma2_leaf = F, num_trees =50, max_depth =15, min_samples_leaf=2)# The first two are for the "T-learner" approach# Train BART on G=1 and G=0 separate# B=1|G=1, test on all training inputs with each model bart1 = stochtree::bart(X_train =as.matrix(covtreat[vars]),y_train = covtreat[Outcome][,1],X_test =as.matrix(df[vars]),num_gfr = num_warmstart, num_burnin = num_burnin, num_mcmc = num_mcmc, general_params = general_params, mean_forest_params = mean_forest_params)#case 2 control # B=1|G=0 bart2 = stochtree::bart(X_train =as.matrix(covcontrol[vars]),y_train = covcontrol[Outcome][,1],X_test =as.matrix(df[vars]),num_gfr = num_warmstart, num_burnin = num_burnin, num_mcmc = num_mcmc, general_params = general_params, mean_forest_params = mean_forest_params)#case 3 propensity: Learn G|x bart3 = stochtree::bart(X_train =as.matrix(df[vars]),y_train = df[treat][,1],X_test =as.matrix(df[vars]),num_gfr = num_warmstart, num_burnin = num_burnin, num_mcmc = num_mcmc, general_params = general_params, mean_forest_params = mean_forest_params)# So called "S-learner"# Include G as a covariate, test on all G=1 and all G=0 bart4 = stochtree::bart(X_train =as.matrix(df[c(vars,treat)]),y_train = df[Outcome][,1],num_gfr = num_warmstart, num_burnin = num_burnin, num_mcmc = num_mcmc, general_params = general_params, mean_forest_params = mean_forest_params)#PB1G1# T-learner pred1 =rowMeans(pnorm(bart1$y_hat_test))# S-learner pred1 =predict(bart4, as.matrix(cbind(df[c(vars)], rep(1, nrow(df))))) pred1 =rowMeans(pnorm(pred1$y_hat))#PB1G0# T-learner pred2 =rowMeans(pnorm(bart2$y_hat_test))#S-learner pred2 =predict(bart4, as.matrix(cbind(df[c(vars)], rep(0, nrow(df))))) pred2 =rowMeans(pnorm(pred2$y_hat))#PG1 pred3 =rowMeans(pnorm(bart3$y_hat_test))# T-learner pred1_specs =pnorm(bart1$y_hat_test)[index, ]# S-learner pred1_full =predict(bart4, as.matrix(cbind(df[c(vars)], rep(1, nrow(df))))) pred1_specs =pnorm(pred1_full$y_hat)[index,]# T-learner pred2_specs =pnorm(bart2$y_hat_test)[index, ]# S-learner pred2_full =predict(bart4, as.matrix(cbind(df[c(vars)], rep(0, nrow(df))))) pred2_specs =pnorm(pred2_full$y_hat)[index,]# hefty files, gotta gorm(pred1_full)rm(pred2_full) pred3_specs =pnorm(bart3$y_hat_test)[index, ] }elseif(mono==T){ num_warmstart <-20 num_burnin <-0 num_mcmc <-500 num_samples <- num_warmstart + num_burnin + num_mcmc general_params <-list(sample_sigma2_global = F, probit_outcome_model = T, num_chains=1) mean_forest_params <-list(sample_sigma2_leaf = F, num_trees =50, max_depth=15, min_samples_leaf=2)#case 3 propensity: Learn G|x bart3 = stochtree::bart(X_train =as.matrix(df[vars]),y_train = df[treat][,1],X_test =as.matrix(df[vars]),num_gfr = num_warmstart, num_burnin = num_burnin, num_mcmc = num_mcmc, general_params = general_params, mean_forest_params = mean_forest_params)#PG1 pred3 =rowMeans(pnorm(bart3$y_hat_test))# mono fits bart_mono =mon_probit(X=as.matrix(df[vars]),g = df[treat][,1], b = df[Outcome][,1])#use this method for prediction on binary#PB1G1 pred1 =rowMeans(bart_mono[[2]])#PB1G0 pred2 =rowMeans(bart_mono[[1]]) pred1_specs = bart_mono[[2]][index, ] pred2_specs = bart_mono[[1]][index, ] pred3_specs =pnorm(bart3$y_hat_test)[index, ] }else{print('mono argument is T or F Boolean') }# print('made it here')if(data_full==T){ expoutcomesfun =cbind( df[,], data.frame( treatpred = pred1, notreatpred=pred2), propensity=pred3 ) expoutcomesfun2 =rbind( NULL, expoutcomesfun ) outcomenew=expoutcomesfun2[,c('propensity','notreatpred','treatpred')]####need the joints#### eps =1e-6 outcomenew[,c("prop", "B_G0", "B_G1")] =0.5*eps + (1-eps)*outcomenew[,c('propensity','notreatpred','treatpred')] outcomenew[,"ProbB1G1"] = outcomenew[,"prop"]*outcomenew[,"treatpred"] outcomenew[,"ProbB1G0"] = (1-outcomenew[,"prop"])*outcomenew[,"notreatpred"] outcomenew[,"ProbB0G1"] = outcomenew[,"prop"]*(1-outcomenew[,"treatpred"])## print('here too')#for error analysis, not super necessary indexes=seq(from=1, to=length(outcomenew$treatpred), by=1) outcomenew=as.data.frame(cbind(indexes, outcomenew)) outcomenew$indexes=as.numeric(outcomenew$indexes)row.names(outcomenew)<-NULL newoutcome4 =as.matrix(outcomenew)return(outcomenew) }else-if (data_full==F){ expoutcomesfun =cbind( data.frame( treatpred = pred1_specs[,1], notreatpred=pred2_specs[,2]), propensity = pred3_specs ) expoutcomesfun2 =rbind( NULL, expoutcomesfun ) outcomenew = expoutcomesfun2[,c('propensity','notreatpred','treatpred')]####need the joints#### eps =1e-6 outcomenew[,c("prop", "B_G0", "B_G1")] =0.5*eps + (1-eps)*outcomenew[,c('propensity','notreatpred','treatpred')] outcomenew[,"ProbB1G1"] = outcomenew[,"prop"]*outcomenew[,"treatpred"] outcomenew[,"ProbB1G0"] = (1-outcomenew[,"prop"])*outcomenew[,"notreatpred"] outcomenew[,"ProbB0G1"] = outcomenew[,"prop"]*(1-outcomenew[,"treatpred"])# print('here too')#for error analysis, not super necessary indexes=seq(from=1, to=length(outcomenew$treatpred), by=1) outcomenew=as.data.frame(cbind(indexes, outcomenew)) outcomenew$indexes=as.numeric(outcomenew$indexes)row.names(outcomenew)<-NULL newoutcome4 =as.matrix(outcomenew)return(outcomenew)}}
We now pivot to our final task, which involves writing yet another helper function. The following function demonstrates how to solve the system of equations in the main model. Of note, we also have to include a constraint in this step to ensure that \(\Pr(B\mid G=1, \mathbf{x})>\Pr(B\mid G=0, \mathbf{x})\). To do so, we borrow a useful from trick from optimization people, which is to re-write \(b_1(\mathbf{x})\) as \(\exp(b_1(\mathbf{x}))+b_0(\mathbf{x})\). This ensures \(b_0(\mathbf{x})<b_1(\mathbf{x})\). We use the Nelder-Mead algorithm built into the R-optim function to minimize the distance between the left and right hand side of our main model equations Equation 2 with squared error as a loss function, as in Equation 3.
The lambda parameter allows us to regularize the difference between \(B^1\) and \(B^0\). This could be helpful if we think we have small effects, but we recommend not doing this regularization, and would instead suggest putting a prior directly on the difference10.
10 To be fair, we do this to some extent already with the monotonic BART prior. Viewing priors as a form of regularization really seems to be far more socially acceptable to a lot of people. This guy doesn’t see anything wrong with informing an analysis with additional prior information though… no need to glaze a prior with the regularization moniker to look cool to the non-Bayesians.
Helper function to project reduced form parameters
While a simulation study typically is meant to explore properties of a method, here we will simply treat the simulation study as some ground truth we uncovered when searching for data. The reason for this is that simulated data is easier to replicate for anyone with the code, rather than forcing readers to download a large data file.
Let’s look at our data:
Click here for full code
options(warn=-1)suppressMessages(library(gt))set.seed(12024)n =1600b0=function(x1,x2,x3,x4,x5){ val =-1.16+ x5 + x1*sin(2*x4)return(val)}b1=function(x1,x2,x3,x4,x5){ val =0.60+b0(x1,x2,x3,x4,x5) +0.4*abs(x3)return(val)}g=function(x1,x2,x3,x4,x5){ val=0.25+0.75*b0(x1,x2,x3,x4,x5) + x2return(val)}a =1x1 =runif(n,-a,a)x2 =runif(n,-a,a)x3 =runif(n,-a,a)x4 =runif(n,-a,a)x5 =runif(n,-a,a)fcorrect=function(u){ val =dnorm(u,mean=0, sd=0.50)return(val)}u =rnorm(n, mean=0, sd=0.50)G_true =pnorm(g(x1,x2,x3,x4,x5) + u)G =rbinom(n, 1,pnorm(g(x1,x2,x3,x4,x5) + u))B =rep(NA,n)B[G ==1] =rbinom(sum(G==1),1, pnorm(b1(x1,x2,x3,x4,x5)[G ==1] + u[G ==1]))B[G ==0] =rbinom(sum(G==0),1,pnorm(b0(x1,x2,x3,x4,x5)[G ==0] + u[G ==0]))#mean(B1.true-B0.true)/mean(B0.true)#mean(B[G == 1]) - mean(B[G == 0])X =as.matrix(cbind(x1,x2,x3,x4, x5))# For the monotone probit model it is necessary to sort the observations so that the G=1 cases are all together# at the start of the outcome vector.index <-sort(G,decreasing =TRUE, index.return =TRUE)X <- X[index$ix,]G <- G[index$ix]B <- B[index$ix]x1 <- x1[index$ix]x2 <- x2[index$ix]x3 <- x3[index$ix]x4 <- x4[index$ix]x5 <- x5[index$ix]# Our actual probabilities, even though we only see the binary formB1 =pnorm(b1(x1,x2,x3,x4,x5)[G ==1] + u[G ==1])B0 =pnorm(b0(x1,x2,x3,x4,x5)[G ==0] + u[G ==0])# Calculate the true potential outcomestau.true =rep(NA,n)for (k in1:n){#print(k) tau.true[k] =integrate(function(u) (pnorm(b1(x1[k], x2[k],x3[k],x4[k],x5[k]) + u) -pnorm(b0(x1[k], x2[k],x3[k],x4[k],x5[k])+u))*fcorrect(u),lower =-Inf, upper =+Inf)$value}B1.true =rep(NA,n)for (k in1:n){#print(k) B1.true[k] =integrate(function(u) (pnorm(b1(x1[k], x2[k],x3[k],x4[k],x5[k]) + u) )*fcorrect(u),lower =-Inf, upper =+Inf)$value}B0.true =rep(NA,n)for (k in1:n){#print(k) B0.true[k] =integrate(function(u) ( pnorm(b0(x1[k], x2[k],x3[k],x4[k],x5[k])+u))*fcorrect(u),lower =-Inf, upper =+Inf)$value}#mean(B1)-mean(B0)#print(table(B,G))hist(G_true, 40, col='#073d6d', main='Probability of going concern being issued')
gt(data.frame(Label =c('True average inducement effect', 'Naive estimate of average inducement effect', 'Ratio of treatment to prognostic'), outcome =c(round(mean(B1.true-B0.true),3), round(mean(B[G ==1]) -mean(B[G ==0]),3),round(mean(B1.true-B0.true)/mean(B0.true), 3) )))
Label
outcome
True average inducement effect
0.203
Naive estimate of average inducement effect
0.441
Ratio of treatment to prognostic
1.023
Since we as real data rather than a rigorous simulation study, it is important to note we designed these data to mimic the real world in critical ways (which the table at the end of the code checks). These are:
The treatment to prognostic ratio is small. That is, the treatment effect from the hypothetical going concern is not the driving force behind bankruptcy. The fundamentals of the hypothetical company, which lead to a going concern issuance, are.
The generation of the hypothetical bankrupt and non bankrupt worlds for a firm derive from complicated functions. This is meant to make the problem challenging in a way that would make the old-school bivariate probit model would falter.
Heterogenous treatment effects. The treatment effect of the going concern depends on the \(\mathbf{x}_2\) variable. The heterogeneity is not super complicated, as the effect is dominated by the constant added to the \(\mathbf{x}_2\) variable.
Realistic bankruptcy probabilities. We create data such that most firms have a low probability of bankruptcy, which generally makes sense.
Realistic going concern opinion probabilities. The going concern probability should be similar to the probability of bankruptcy “prognostic” case. This assumption is not crazy at all, because it implies a selection effect where companies more at risk for bankruptcy are more at risk of being issued a going concern, which was the exact motivation for this undertaking.
Some additional notes. The data generation process insists that the probability of bankruptcy is greater if a going concern is issued than if one is not. Of course, this need not be the case, and if we do not think this is the case, we just turn off the constraint flags in the integration code and the monotone BART code. Finally, we assumed the “hidden information”, i.e. the confounding we do not observe follows a standard normal distribution. Obviously, we do not know this in practice and will proceed as if we do not, to mimic the sensitivity analysis one should perform when using this methodology.
Sensitivity analysis
We will proceed with a sensitivity analysis to illustrate how to go about this type of analysis. We do some interrogation work of the methodology, but mostly present results for these hypothetical data.
The top plot shows the inducement effects assuming the hidden confounding follows a normal distribution with mean \(0\) and standard deviation \(0.50\)11. The bottom panel describes how well the estimates line up with the known truth. In reality, we do not know the true causal inducement effects, but since these data are simulated, we can use this plot to test the methodology.
11 Conveniently, the exact same distribution as the known ground truth.
In the next plot, \(f(u)\) is distributed \(N(0, \sigma=0.25)\). The smaller variance on \(u\) means we anticipate there is less hidden information. This corresponds to an analysis where would believe to have collected nearly enough data to satisfy ignorability. The lower variance implied in the “hidden confounding” suggests that our estimates of inducement effects should be close to the causal effect calculated after adjusting for observed covariates that could be confounders. We therefore expect to overestimate the inducement effects compared to the ground truth in the bottom panel.
where \(f(\cdot)\) is the pdf of the normal distribution with standard deviation \(s\), and \(q = \mbox{Pr}(U < 0)\) controls the skewness.
The distribution looks like a sharks fin.
The sharkfin suggests that there is more mass on the positive side of the confounding, which in our model suggests hidden information corresponds to scenarios where the confounding is detrimental to a companies health. Figure 4 of (Papakostas et al. 2023) compares how the sharkfin distribution diminishes the observed inducement effect versus a normal distribution with the same variance12. We calculate \(q\) and \(s\) such that the standard deviation of the sharkfin is \(0.5\).
12 The implied variance of the shark fin distribution is given by (using properties of the truncated normal distribution) \[
\begin{align}
\mu_{\text{comb}}&=q\cdot \mu_{+}+(1-q)\cdot \mu_{-}\\
\sigma_{\text{comb}}^2&=q\cdot\left(\sigma_+^2+\mu_{+}^2\right)+(1-q)\cdot\left(\sigma_-^2+\mu_{-}^2\right)-\mu_{\text{comb}}^2\end{align}
\]
options(warn=-1)# Calculate implied variance q =0.1# Variance we want var =0.5^2 a=dnorm(0) b=(dnorm(0)/(1-pnorm(0)))# calculated variance given q of the sharkfin var_shark=(1-q)*a*((1-q)/q)^2+ q*a + (b*(1+ (1-q)/q))^2*q*(1-q)# multiply that variance by s to get back the expected on# s*var_shark = var_wanted s =sqrt(var/var_shark) s =sqrt(var/var_shark)sharkfin =function(z){ val =rep(NA, length(z)) val[z <0] =2*q*dnorm(z[z<0], sd = s) val[z>0] =2*(1-q)*dnorm(z[z>0],0,sd = s*(1-q)/q)return(val) }z =seq(from=-4,to=4, length.out=1000)ncores <-detectCores() -1treat_frame=integrate_function(intframe, constraint=T, f=sharkfin, n_cores=ncores, lambda=0)dataset=data.frame(covariates, diff=treat_frame[, 'B1']-treat_frame[, 'B0'])u_shark =data.frame(x=z,y=sharkfin(z))%>%ggplot(aes(x=x,y=y))+geom_line(color='#012296', lwd=2)+ggtitle('Sharkfin')+theme_minimal(base_size=26)+theme(plot.title =element_text(hjust =0.5,size=28))u_shark_tau = dataset%>%ggplot(aes(x=diff))+geom_histogram(aes(y=..count../sum(..count..)) ,color='white' ,fill='#1d2951', bins=25)+xlim(-0.05, 0.65)+ggtitle('All observations')+ylab('Density')+xlab('Inducement effects across all companies')+theme_minimal(base_size =26)+theme(plot.title =element_text(hjust =0.5,size=28))df_temp =data.frame(estimated = dataset$diff, true = B1.true-B0.true )u_shark_comp = df_temp%>%ggplot(aes(x=true, y=estimated))+geom_point(color='#073d6d', size=1.4, alpha=0.88)+geom_abline(a=0,b=1,lwd=2.75, color='#d47c17')+ggtitle('Comparison if we actually knew the truth')+ylab('Estimated inducement effects')+xlab('True inducement effects')+theme_minimal(base_size =22)+theme(plot.title =element_text(hjust =0.5,size=22))grid.arrange(u_shark, u_shark_tau, u_shark_comp, layout_matrix =matrix(c(1, 1, 2, 2, 4, 3, 3, 4), nrow =2, byrow =TRUE))
Evidently, despite providing the correct variance for \(u\), the shape of the sharkfin led us to have worse estimates of the inducement effect, often slightly under-estimating the inducement effect with this sharkfin distribution. What if we flipped the symmetry of the sharkfin? If we choose our parameters to see a sharkfin with more mass in the \(u<0\) regime, indicating the hidden confounding should pull the bankruptcy probabilities closer to 0. Again, we choose the \(q\) and \(s\) parameters so that the standard deviation of this sharkfin is \(0.5\).
Click here for full code
options(warn=-1)# calculate implied variance q =0.9# Variance we want var =0.5^2 a=dnorm(0) b=(dnorm(0)/(1-pnorm(0)))# calculated variance given q of the sharkfin var_shark=(1-q)*a*((1-q)/q)^2+ q*a + (b*(1+ (1-q)/q))^2*q*(1-q)# multiply that variance by s to get back the expected on# s*var_shark = var_wanted s <-sqrt(var/var_shark)f =function(z){ val =rep(NA, length(z)) val[z <0] =2*q*dnorm(z[z<0], sd = s) val[z>0] =2*(1-q)*dnorm(z[z>0],0,sd = s*(1-q)/q)return(val)}z =seq(from=-4,to=4, length.out=1000)ncores <-detectCores() -1treat_frame=integrate_function(intframe, constraint=T, f=f, n_cores=ncores, lambda=0)dataset=data.frame(covariates, diff=treat_frame[, 'B1']-treat_frame[, 'B0'])u_shark2 =data.frame(x=z,y=f(z))%>%ggplot(aes(x=x,y=y))+geom_line(color='#012296', lwd=2)+ggtitle('Sharkfin')+theme_minimal(base_size=26)+theme(plot.title =element_text(hjust =0.5,size=28))u_shark_tau2 = dataset%>%ggplot(aes(x=diff))+geom_histogram(aes(y=..count../sum(..count..)) ,color='white' ,fill='#1d2951', bins=25)+xlim(-0.05, 0.65)+ggtitle('All observations')+ylab('Density')+xlab('Inducement effects across all companies')+theme_minimal(base_size =26)+theme(plot.title =element_text(hjust =0.5,size=28))df_temp =data.frame(estimated = dataset$diff, true = B1.true-B0.true )u_shark_comp2 = df_temp%>%ggplot(aes(x=true, y=estimated))+geom_point(color='#073d6d', size=1.4, alpha=0.88)+geom_abline(a=0,b=1,lwd=2.75, color='#d47c17')+ggtitle('Comparison if we actually knew the truth')+ylab('Estimated inducement effects')+xlab('True inducement effects')+theme_minimal(base_size =22)+theme(plot.title =element_text(hjust =0.5,size=22))grid.arrange(u_shark2, u_shark_tau2, u_shark_comp2, layout_matrix =matrix(c(1, 1, 2, 2, 4, 3, 3, 4), nrow =2, byrow =TRUE))
Now we over-estimate the true inducement effects (only a little), as expected.
The next two plots look at the same shark plots, but now with implied standard deviations of \(1\), meaning we should underestimate inducement effects as they are pulled towards zero. Let’s see how each shark shape impacts that.
Click here for full code
options(warn=-1)# calculate implied variance q =0.1# Variance we want var =1^2 a=dnorm(0) b=(dnorm(0)/(1-pnorm(0)))# calculated variance given q of the sharkfin var_shark=(1-q)*a*((1-q)/q)^2+ q*a + (b*(1+ (1-q)/q))^2*q*(1-q)# multiply that variance by s to get back the expected on# s*var_shark = var_wanted s <-sqrt(var/var_shark)f =function(z){ val =rep(NA, length(z)) val[z <0] =2*q*dnorm(z[z<0], sd = s) val[z>0] =2*(1-q)*dnorm(z[z>0],0,sd = s*(1-q)/q)return(val)}z =seq(from=-4,to=4, length.out=1000)ncores <-detectCores() -1treat_frame=integrate_function(intframe, constraint=T, f=f, n_cores=ncores, lambda=0)dataset=data.frame(covariates, diff=treat_frame[, 'B1']-treat_frame[, 'B0'])u_shark4 =data.frame(x=z,y=f(z))%>%ggplot(aes(x=x,y=y))+geom_line(color='#012296', lwd=2)+ggtitle('Sharkfin')+theme_minimal(base_size=26)+theme(plot.title =element_text(hjust =0.5,size=28))u_shark_tau4 = dataset%>%ggplot(aes(x=diff))+geom_histogram(aes(y=..count../sum(..count..)) ,color='white' ,fill='#1d2951', bins=25)+xlim(-0.05, 0.65)+ggtitle('All observations')+ylab('Density')+xlab('Inducement effects across all companies')+theme_minimal(base_size =26)+theme(plot.title =element_text(hjust =0.5,size=28))df_temp =data.frame(estimated = dataset$diff, true = B1.true-B0.true )u_shark_comp4 = df_temp%>%ggplot(aes(x=true, y=estimated))+geom_point(color='#073d6d', size=1.4, alpha=0.88)+geom_abline(a=0,b=1,lwd=2.75, color='#d47c17')+ggtitle('Comparison if we actually knew the truth')+ylab('Estimated inducement effects')+xlab('True inducement effects')+theme_minimal(base_size =22)+theme(plot.title =element_text(hjust =0.5,size=22))grid.arrange(u_shark4, u_shark_tau4, u_shark_comp4, layout_matrix =matrix(c(1, 1, 2, 2, 4, 3, 3, 4), nrow =2, byrow =TRUE))
Click here for full code
options(warn=-1)# calculate implied variance q =0.9# Variance we want var =1^2 a=dnorm(0) b=(dnorm(0)/(1-pnorm(0)))# calculated variance given q of the sharkfin var_shark=(1-q)*a*((1-q)/q)^2+ q*a + (b*(1+ (1-q)/q))^2*q*(1-q)# multiply that variance by s to get back the expected on# s*var_shark = var_wanted s <-sqrt(var/var_shark)f =function(z){ val =rep(NA, length(z)) val[z <0] =2*q*dnorm(z[z<0], sd = s) val[z>0] =2*(1-q)*dnorm(z[z>0],0,sd = s*(1-q)/q)return(val)}z =seq(from=-4,to=4, length.out=1000)ncores <-detectCores() -1treat_frame=integrate_function(intframe, constraint=T, f=f, n_cores=ncores, lambda=0)dataset=data.frame(covariates, diff=treat_frame[, 'B1']-treat_frame[, 'B0'])u_shark3 =data.frame(x=z,y=f(z))%>%ggplot(aes(x=x,y=y))+geom_line(color='#012296', lwd=2)+ggtitle('Sharkfin')+theme_minimal(base_size=26)+theme(plot.title =element_text(hjust =0.5,size=28))u_shark_tau3 = dataset%>%ggplot(aes(x=diff))+geom_histogram(aes(y=..count../sum(..count..)) ,color='white' ,fill='#1d2951', bins=25)+xlim(-0.05, 0.65)+ggtitle('All observations')+ylab('Density')+xlab('Inducement effects across all companies')+theme_minimal(base_size =26)+theme(plot.title =element_text(hjust =0.5,size=28))df_temp =data.frame(estimated = dataset$diff, true = B1.true-B0.true )u_shark_comp3 = df_temp%>%ggplot(aes(x=true, y=estimated))+geom_point(color='#073d6d', size=1.4, alpha=0.88)+geom_abline(a=0,b=1,lwd=2.75, color='#d47c17')+ggtitle('Comparison if we actually knew the truth')+ylab('Estimated inducement effects')+xlab('True inducement effects')+theme_minimal(base_size =22)+theme(plot.title =element_text(hjust =0.5,size=22))grid.arrange(u_shark3, u_shark_tau3, u_shark_comp3, layout_matrix =matrix(c(1, 1, 2, 2, 4, 3, 3, 4), nrow =2, byrow =TRUE))
Unsurprisingly, the shark fin with more positive mass (confusingly labeled “left skewed”) more quickly dissipates the inducement effect.
Finally, another fun distribution on the information available only to the auditors is a mixture distribution. The thinking here is similar to the sharkfin distribution, with the difference being we mostly assume no additional confounding, but with a small probability of the hidden information dooming a company to bankruptcy and a smaller probability of being tremendously helpful to a company’s bankruptcy risk13.
13 The implied variance of the Gaussian mixtures is given (with 3-components as above) by
We now look at posterior inducement effect estimates for some hypothetical companies in our dataset. Of course, these are all made up, but this section is meant to illustrate how this analysis would proceed if these data were real. We specify a normal distribution for \(f(u)\) with standard deviation of \(0.5\) and a mean of \(0\), which is conveniently the (unidentifiable) true distribution that we could never know in the wild.
The five “companies” of interest, corresponding to the 4th, 10th, 16th, 22nd, and 23rd indices are:
Hellenic mist: An architecture company.
Bellhop rouge: Create vintage lighters. A niche etsy.
Albuquerque Derecho: A sports franchise that recently debuted in the Q.
Muon breeze: A young company making hydrogen powered self driving cars.
Blue be gone: A pioneering start-up in the field of blue-light glasses.
Click here for full code
options(warn=-1)suppressMessages(library(tidyverse))suppressMessages(library(ggtext))suppressMessages(library(ggdist))suppressMessages(library(glue))suppressMessages(library(patchwork))df_plot =data.frame(word =c(rep('Hellenic mist',num_mcmc), rep('Bellhop rouge',num_mcmc), rep('Blue be gone',num_mcmc), rep('Muon breeze', num_mcmc),rep('Albuquerque Derechos',num_mcmc)), # rep('Company 22',num_mcmc)), price =c(data.frame(treat_company10)$B1-data.frame(treat_company10)$B0, data.frame(treat_company4)$B1-data.frame(treat_company4)$B0, data.frame(treat_company8)$B1-data.frame(treat_company8)$B0,data.frame(treat_company22)$B1-data.frame(treat_company22)$B0, data.frame(treat_company16)$B1-data.frame(treat_company16)$B0))#, # data.frame(treat_company23)$B1-data.frame(treat_company23)$B0))font_family ='roboto'bg_color='#f8f9fa'p <- df_plot %>%ggplot(aes( forcats::fct_relevel(word, sort), price)) +stat_slab(alpha =0.63, normalize='groups', outline_bars=T, fill='#073d6d', breaks=50, scale=0.75, expand=F) +stat_interval() +stat_summary(geom ="point", fun = mean, color='#3CF0C2', size=1.25) +#geom_hline(yintercept = median_price, col = "grey30", lty = "dashed") +# annotate("text", x = 16, y = median_price + 50, label = "Median rent",# family = "Fira Sans", size = 3, hjust = 0) +scale_x_discrete(labels = toupper) +scale_color_manual(values = MetBrewer::met.brewer("Cassatt2")) +coord_flip(clip ="off") +guides(col ="none") +labs(title =toupper("Inducement effects"),subtitle ='Simulated data and fake company names. Meant to mimic how an actual analysis would proceed.',caption ="Designed from r-graph-gallery.com/web-ridgeline-plot-with-inside-plot-and-annotations.html <br> Visualization: Ansgar Wolsing",x =NULL,y ="Causal inducement impact (difference)" ) +theme_minimal(base_family = font_family) +theme(plot.background =element_rect(color =NA, fill = bg_color),panel.grid =element_blank(),panel.grid.major.x =element_line(linewidth=0.16, color ="grey75"),plot.title =element_text(family ="Fira Sans SemiBold", margin =margin(t =10, b =10, r=4, l=4), size =14),plot.title.position ="plot",plot.subtitle =element_textbox_simple(margin =margin(t =4, b =16, r=4, l=4), size =12),plot.caption =element_textbox_simple(margin =margin(t =12), size =9 ),plot.caption.position ="plot",axis.text.y =element_text(hjust =0, margin =margin(r =0), family ="Roboto", size=10),axis.text.x =element_text(hjust =0, margin =margin(r =0), family ="Roboto", size=10),plot.margin =margin(1, 1, 1, 1) )p
Figure 4: Posterior distribution of inducement effects across various hypothetical firms
Figure 4 shows some variation across the firms in the distribution of inducement effects. Additionally, we see bounds as to what the causal inducement effects can be. One was self imposed that they must be positive and the other is a consequence of the fact that probabilities cannot take on values greater than \(1\), so the maximum inducement effects are also bounded by \(1-\text{prognostic effect}\).
Summary
The bankruptcy problem is particularly compelling for a number of reasons.
The self-fullfilling prophecy aspect of the forecasting bankruptcy makes for particularly good kitchen table banter, regardless of anyones advanced degree credentials.
The outcome being binary makessense! Turning a continuous outcome into a classification problem has long been a thorn in the side of many statisticians. It is both unnecessary and stupid. In this example, however, bankruptcy is a yes or a no, which opens up the analysis presented.
While not discussed here, other methods that try to model how much unobserved confounding could plague a causal analysis are pretty limited in scope.
Finally, the application is a clear example where we strongly suspect the ignorability condition that is often invoked in causal studies to be violated.
Other examples
Having binary outcome and treatment data where you are pretty sure there is some sort of unobserved confounding is probably a pretty rare scenario. However, one example (with multiple applications) immediately come to mind.
The effect of being an underdog on winning events. The screengrab below is an example at the all-star break of the 2021 MLB season.
Similarly, there is an argument to be made that a similar phenomenon plays out in elections. A favorite in a state may see their vote share undersold as hesitant voters do need feel the need to vote if they think the candidate will win.
From the now defunct fivethirtyeight website.
Bonus
If you want to check that our model is really a generalized bivariate probit, use the following code to generate from a bivariate probit, then run the methods outlined in this document with \(f(u)\sim N(0, \frac{\rho}{1-\rho})\).
R code to simulate data from a bivariate probit model
set.seed(0)N <-1000# Number of random samples1a=1x1=runif(N, -a,a)x2=runif(N, -a,a)x3=runif(N,-a,a)x4=runif(N,-a,a)x5=runif(N, -a,a)2beta1=-0.2alpha1=0.7beta0=-0alpha0=-0.5mu1 <- beta0+beta1*(x1+x2+x3+x4+x5)mu2 <- alpha0+alpha1*(x1+x2+x3+x4+x5)mu<-matrix(c(mu1, mu2), nrow=N, ncol=2)3rho=.5gamma=14B1.true=pnorm(mu2+gamma)B0.true=pnorm(mu2)5sigma <-matrix(c(1, rho,rho,1),2) # Covariance matrixsim_data=t(sapply(1:N, function(i)MASS::mvrnorm(1, mu = mu[i,], Sigma = sigma )))#generate the binary treatments6G=sapply(1:N, function(i)ifelse(sim_data[i,1]>=0, 1,0))#generate the binary outcomesB=sapply(1:N, function(i)ifelse(sim_data[i,2]>=-1*gamma*G[i], 1,0))7print(table(G,B))
1
Simulate covariates
2
Simulate parameters of the mean term
3
Simulate causal effect parameters and confounding level
4
Calculate true treatment effects
5
Generate covariate matrix and simulate from bivariate probit
6
Generate treatment and outcome
7
Check how many of each we have
Finally, if you are interested in further reading, please check out (Papakostas et al. 2023). Some things they discuss that did not make it into this document:
A thorough simulation study into the accuracy of the proposed methodology in the face of mis-specification of the distribution of \(U\). As expected, and wanted, if the distribution we choose for \(U\) is totally wrong, we get poor accuracy. However, if we are in the rough ballpark, we do a pretty good job.
The paper also compares simulations results to the bivariate probit model. If the data are simulated according to the bivariate probit generating process, then both methods are similarly accurate. However, outside this narrow simulation schema, the bivariate probit does not return trustworthy results, whereas the methods presented in this document are fairly robust across these different data generating processes.
A comparison with E-values (Peng and VanderWeele 2016), a method that also studies the strength of causal estimates in the presence of unobserved confounding.
The paper explores the impact of different distributions of \(U\) with equivalent variances on the dissipation of the inducement effect. This was tested versus the base case of no hidden confounding on the actual bankruptcy data. Interestingly, different shapes of \(f(u)\), even with equivalent variances, show the causal effect to disappear quicker.
Finally, following in the footsteps of subgroup analyses presented by (Woody, Hahn, and Murray 2020) and (P. R. Hahn, Murray, and Carvalho 2020), a decision tree is fit to inducement effects to help summarize where we see larger and smaller inducement effects.
Interestingly, I asked Chat-GPT to summarize this paper as a test of the chatbot’s capabilities… and it actually did a pretty good job! It connected different aspects of the model together. While a lot of the summary was straight from the paper itself, the chatbot’s explanations upon further investigation were also solid and coherent. This is certainly not always (often?) true, but applied statistical methodology paper summary (according to this sample size of 1 testimonial) seems to be a compelling chatbot use case.
References
Abadie, A., A. Diamond, and J. Hainmuller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.”Journal of the American Statistical Association 105 (490): 493–505. https://doi.org/10.1198/jasa.2009.ap08746.
Abadie, A., and J. Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.”American Economic Review 93 (1): 113–32.
Albert, James H, and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.”Journal of the American Statistical Association 88 (422): 669–79.
Card, D., and A. B. Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.”American Economic Review 84 (4): 772–93. https://www.jstor.org/stable/2118030.
Chipman, Hugh A and, Edward I George, and Robert E McCulloch. 2012. “BART: Bayesian Additive Regression Trees.”Annals of Applied Statistics 6 (1): 266–98.
Hahn, P Richard, Jingyu He, and Hedibert F Lopes. 2019. “Efficient Sampling for Gaussian Linear Regression with Arbitrary Priors.”Journal of Computational and Graphical Statistics 28 (1): 142–54.
Hahn, P Richard, Jared S Murray, and Carlos M Carvalho. 2020. “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion).”Bayesian Analysis 15 (3): 965–1056.
Herren, Drew, Richard Hahn, Jared Murray, Carlos Carvalho, and Jingyu He. 2025. Stochtree: Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference. https://stochtree.ai/.
Holland, P. 1986. “Statistics and Causal Inference.”Journal of the American Statistical Association 81 (396): 945–60.
Papakostas, Demetrios, P Richard Hahn, Jared Murray, Frank Zhou, and Joseph Gerakos. 2023. “Do Forecasts of Bankruptcy Cause Bankruptcy? A Machine Learning Sensitivity Analysis.”The Annals of Applied Statistics 17 (1): 711–39.
Rubin, D. B. 1974. “Estimating Causal Effects of Treatments in Randomized and Non Randomized Studies.”Journal of Educational Psychology 66 (5): 688–701.
Thistlethwaite, D., and D. Campbell. 1960. “Regression-Discontinuity Analysis: An Alternative to the Ex Post Facto Experiment.”Journal of Educational Psychology 51 (6): 309–17. https://doi.org/doi:10.1037/h0044319.
Woody, C., S. Carvalho, P. R. Hahn, and J. Murray. 2020. “Estimating Heterogeneous Effects of Continuous Exposures Using Bayesian Tree Ensembles: Revisiting the Impact of Abortion Rates on Crime.”Arxiv Preprint.
Woolridge, J. 2010. Econometric Analysis of Cross Section and Panel Data. Cambridge, Massachusetts: Massachusetts Institute of Technology.