We consider several alternative linear regression models for optimizing the NIJFM. All linear models will be of the form,

$$\begin{aligned} p_i=X^t_i\theta , \end{aligned}$$

(3)

where \(X^t_i\) is a covariate vector for individual *i* and \(\theta \) is a vector of coefficients that we are optimizing. Thus differences in the models will be defined by differences in how the \(\theta \) are estimated or by post-processing of the scores \(p_i\).

The first approach is simply linear regression where the optimal coefficient vector for minimizing the MSE solves the linear equation:

$$\begin{aligned} \frac{2}{N}X^t X\theta -\frac{2}{N}X^ty=0. \end{aligned}$$

(4)

We also consider a balanced version of linear regression where the observations are re-weighted so that observations of each racial group contribute equally to the MSE. We also analyze linear regressions that are estimated on each racial group separately. We refer to these methods as linear reg., linear reg. (balanced) and linear reg. (group) respectively.

The next method, outlined in (Yahav and Katrina 2017; Richard et al. 2017), considers a convex surrogate loss where the step function representing the decision at the cutoff is replaced by a linear approximation (simply the score itself):

$$\begin{aligned} MSE+\lambda \bigg (\sum _{X_i\in S_{00}}\frac{X_i^t\theta }{|S_{00}|}-\sum _{X_i\in S_{10}}\frac{X_i^t\theta }{|S_{10}|}\bigg )^2. \end{aligned}$$

(5)

Here \(S_{00}\) is the set of individuals of race 0 that did not recidivate (\(y_i=0\)) and \(S_{10}\) is the set of individuals of race 1 that did not recidivate. The penalty term encourages the average scores over the negative class (\(y_i=0\)) to be matched across race (as \(\lambda \) increases). This is a form of group fairness where we wish false positive rates to match across groups (alternatively individual fairness can be defined by bringing the summation outside of the squared term (Richard et al. 2017)). Because the loss function in Eq. (5) is quadratic, there is an analytical solution determined by the linear system:

$$\begin{aligned} \bigg [\frac{2}{N}X^t X+2\lambda (V_0^t V_0-V_0^t V_1-V_1^tV_0+V_1^tV_1) \bigg ]\theta -\frac{2}{N}X^ty=0, \end{aligned}$$

(6)

where \(V_j=\sum _{X_i\in S_{j0}}\frac{X_i^t}{|S_{j0}|}\). We select \(\lambda \) by choosing the value that yields the best NIJFM score on the training data. We refer to this method as the convex surrogate method.

Fairness can also be encouraged by post-processing the scores (Dennis et al. 2020). Here we use a simple shrinkage method where, for Black individuals^{Footnote 2} above the decision boundary cutoff (0.5 or greater for the NIJ competition), we subtract a constant value \(\epsilon \) from their scores and then take the max of that value and the cutoff minus .0001.^{Footnote 3} We then choose the value of \(\epsilon \) that optimizes the NIJFM on the training data. We refer to this method as linear regression with shrinkage. A second, even simpler, post-processing technique forces the false positive rates to zero by truncating all scores to the cutoff value (minus 0.0001) if they are above the cutoff. We refer to this method as linear regression with truncation.

While the step function representing the decision boundary in the NIJFM makes the metric non-continuous and non-differentiable, nonetheless one can attempt to optimize it with general purpose optimization software. We find that the optim function in the R stats library works reasonably well at optimizing the NIJFM when given the linear regression coefficients as an initial guess (and using the “BFGS” method, a quasi-Newton method with finite difference approximations for derivatives). We refer to this method as BFGS.

Finally, we implemented a logistic regression to investigate the effects of using a binomial likelihood instead of Gaussian and a gradient boosting model (xgboost (Tianqi and Carlos 2016)) to explore the performance of a nonlinear machine learning approach. Hyper-parameters for the convex surrogate, shrinkage, and xgboost models are tuned using a grid search on the training data and model performance is evaluated on held-out test data. The code to reproduce the results is available on github.^{Footnote 4}