Using matching methods to account for selection bias in Norway’s Primary Care Teams (PCT) pilot

Norway is piloting team-based primary care delivery models: Honorarmodellen (HM) and Driftstilskuddsmodellen (DM). In addition to organisational changes, the DM transforms provider payment, which seems to attract specific practices. This, coupled with the small number of DM practices, makes it difficult to produce credible evidence regarding the model and its effects on health system performance. I examine whether matching methods—specifically, coarsened exact matching, propensity score matching, and propensity score weighting—can improve evaluation in this demanding situation. As in previous studies on the small sample performance of matching methods, I find no clear best method. This suggests using propensity score weighting, which does not discard data. In the final section of the article, I offer additional advice to help improve the evaluation in similar situations. JEL classification: I10, I18


Introduction
Efficient healthcare systems depend on good primary care. Good primary care requires the right organisation. The Organisation for Economic Co-operation and Development advocates multiprofessional or team-based care (OECD, 2020). Traditionally, the individual general practitioner (GP) assumes sole responsibility for a patient's care. Team-based care redistributes the responsibility among the members of a practice team. The aim is more comprehensive and coordinated care for patients with complex care needs.
Among the Nordic countries, team-based care can be observed in Sweden and Finland, and Norway and Denmark may follow suit. First, the Nordic countries all face the same overall healthcare challenges and may look to each other for solutions (Krasnik and Paulsen, 2009;Olsen et al., 2016). Second, patient and provider surveys suggest a positive outlook towards team-based care in both Norway and Denmark (EY and Vista Analyse, 2019;OECD, 2017).
To date, studies on the effects of team-based care on health system performance have largely been focused on Canada (Glazier et al., 2015;Strumpf et al., 2017a), France (Mousquès and Bourgueil, 2014), and the United States (Friedberg et al., 2015;Jones et al., 2016;Mahmud et al., 2018;Pines et al., 2015;Rosenthal et al., 2016;van Hasselt et al., 2015). While the French and North American experience demonstrates that team-based care can improve health system performance, it is less informative about how this type of primary care delivery model should be implemented elsewhere. To guide implementation, there may be local trials in which practices volunteer to test team-based care. Since volunteers might differ from those that opt out, trials are susceptible to selection bias. To account for selection bias, the French and North American evaluations use matching methods to construct a control group that resembles the treated group at baseline, and difference-in-differences (DD) methods to compare the groups.
The Nordic countries provides an attractive context for this strategy owing to their national registries, which meet the data requirements both for constructing credible control groups and conducing DD analysis. The challenge in Nordic countries, however, is that funders lack the resources to include an adequate number of practices. This might render matching methods unfit, as they were developed under the assumption of an adequate sample size. Moreover, Cenzer et al. (2020) note that the small sample performance of matching methods has only been scarcely studied, and that current evidence based on simulation studies suggests no clear best method. The present article complements the literature with a novel comparison of three common matching methods using a real-world example: Norway's Primary Care Teams (PCT) pilot. The methods considered herein are coarsened exact matching (CEM), propensity score matching (PSM), and propensity score weighting (PSW). However, the main contribution is new insight into the type of practices that team-based primary care delivery models might attract.

Norway's PCT pilot
The last comprehensive reform to Norwegian primary care was the regular GP scheme in 2001, which introduced a list patient system. While this system has enabled GPs to fulfil the primary care function for many years, increased GP workload has raised concerns about the quality of primary care (Svedahl et al., 2019). Therefore, Norway is piloting two team-based primary care delivery models: Honorarmodellen (HM) and Driftstilskuddsmodellen (DM). All practices shared by at least three GPs were eligible to apply, but only 13 applied, all of which were admitted. These employed nurses, and GPs, nurses and medical secretaries formed GP-led care teams.
The regular GP scheme also introduced a default remuneration system that blends capitation (30% of payment) and fee-for-service (ffs) (70% of payment). The HM maintains the traditional remuneration system, but to cover the cost of employing nurses, GPs receive government funding as well as ffs for the services that the nurses perform. In the DM, only 20% of the payment is ffs, while 80% is a practice allowance. The allowance is based on average GP remuneration, adjusted in particular for the cost of employing nurses and the age-sex distribution of list patients. 1 Lastly, some GPs are employed by their local municipality for a fixed salary.

Data source and sample selection
To describe practices at baseline, I combine data from several Norwegian registries for three years prior to the announcement of the PCT pilot. Data on practices and GPs comes from the Norwegian Health Economics Administration. 2 I retain practices shared by at least three GPs that were active throughout the baseline period. These data include GP identifiers, which allow me to link practice and GP information to the GP database. The GP database provides the link between GPs and patients, and patients are uniquely identified by a personal identification number. I use patients' personal identification numbers to retain information on year of birth, sex, and socioeconomic background from the Statistics Norway's database, and healthcare utilisation and chronic conditions from the Primary Care Registry of Norway and the Norwegian Patient Registry. The sample consist of five DM practices, eight HM practices and 817 traditional practices. Together, they serve about 75% of the Norwegian population.

Descriptive statistics
A list of observed characteristics, along with features of their distribution stratified by model, is shown in Table 1.  Table reports the sample mean of observed baseline characteristics for all Norwegian general practices shared by at least three GPs that were active throughout the baseline period. The samples in Columns 2 and 3 include practices that participate in the PCT pilot with the DM and HM, respectively. Column 4 reports the same characteristics for practices outside the PCT pilot. Standard deviation in parenthesis. (a) Expected income gain is the relative difference between observed and counterfactual income conditional on joining the DM. (b) Centrality ranks municipalities from most to least central on a scale of 1-4, that is, practices with a centrality of 1 are located in urban regions. (c) Experience is the average number of years the GPs have been working in the local municipality. (d) Disability benefits is the annual percentage of list patients who receive disability benefits. (e) Sick leave is the annual percentage of list patients on sick leave. (f) Healthcare utilisation is the annual number of services per list patient. (g) A specific chronic condition is the percentage of list patients with at least one GP service related to that condition. Conditions reflect PCT pilot target patients.
According to these summary statistics, DM practices are distinct in terms of practice and GP characteristics, whereas HM practices resemble traditional practices. This is possibly because the DM attracts salaried GPs who avoid the financial risk associated with changes to provider payment. Salaried GPs work in public practices, which typically possess distinctive characteristics.
Expected income gain, the relative difference between observed and counterfactual income conditional on joining the DM, is of particular interest. According to this measure, the DM is more profitable for DM practices than traditional practices. This is despite the low share of elderly patients in DM practices, which negatively affects the practice allowance. One plausible explanation is that the remuneration of salaried GPs, who submit ffs claims on behalf of their employer, is well below average GP remuneration. If the practice allowance is closer to the average GP remuneration, salaried GPs gain from joining the DM. Table 1 shows that the annual number of GP services is 5.2 in DM practices and 5.9 in traditional practices, lending support to this explanation.
Given that matching methods are most useful in imbalanced data, and may be harmful in balanced data (King and Nielsen, 2019), I will focus on constructing a control group for the DM group. I will also refer to traditional practices as non-DM practices.

Matching methods
The term matching refers to equating groups in terms of observed baseline characteristics to reduce bias due to confounding. Ideally, for each DM practice, one would find one or more non-DM practices with identical values on all covariates in Table 1, against which treatment effects could be assessed. However, exact matching is not possible because is high dimensional, and most of the covariates are continuous. This problem motivates inexact matching methods, which find or upweight non-DM practices whose is similar to that of DM practices. A common balance check is the absolute standardised mean difference (ASMD), which refers to the absolute mean difference between the DM and non-DM groups divided by the standard deviation in the DM group, | ̅ − ̅ − |/ . An ASMD less than 0.2 is usually considered negligible (Austin, 2009). However, because matching is implemented iteratively, I need a single summary statistic, rather than the ASMD for each covariate, to determine easily whether new iterations improve balance. Therefore, I use the sum of the ASMDs as a balance check,

Coarsened exact matching
Exact matching on continuous covariates is not possible because there are few observations at each value . CEM temporarily coarsens continuous covariates, making them discrete, and thus ensures more observations at each value . DM and non-DM practices with identical values for categorical or coarsened continuous covariates are matched (Iacus et al., 2012). I implement CEM based on a forward selection procedure. That is, I begin with no covariates in the CEM and test the addition of each covariate for different levels of coarsening using the balance check. Specifically, I coarsen continuous covariates using between 1 and 10 evenly spaced cut points, and categorical covariates are left as is. I add the covariate whose inclusion improves balance the most. This process is repeated until no further improvements can be made.

Propensity score methods
The propensity score, ( ), is the predicted probability that a practice joins the DM given . Rosenbaum and Rubin (1983) show that if selection bias can be accounted for by controlling for , it can also be accounted for by controlling for ( ), thus solving the problem of being high dimensional. The problem then becomes estimating ( ). 4.2.1 Propensity score matching Even when matching on ( ) alone, exact matching is not possible, so DM and non-DM practices with similar propensity scores are matched. To predict ( ), I use logistic regression. I develop the model by progressively adding the covariates that result in the largest improvement in balance in the propensity score-matched sample. I also consider adding derived functions of covariates already in the model, such as interactions and quadratic terms. When matching on propensity scores, two issues are relevant: (1) the number of matched non-DM practices, and (2) whether multiple DM practices can be matched to the same non-DM practice. I let the number of matched non-DM practices vary between 1 and 10 and allow multiple DM practices to be matched to the same non-DM practice. 4.2.2 Propensity score weighting In this technique, non-DM practices are weighted by the inverse probability of being in the non-DM group given , = ( )/(1 − ( )). That is, non-DM practices that resemble the DM group and differ from their own group receive larger weights. The prediction of ( ) for weighting is analogous to the prediction of ( ) for matching.

Results
Using the procedures outlined above, I implement each method to maximise the balance between the DM and non-DM groups. 3 Specifically, to implement CEM, I match DM and non-DM practices with identical values for coarsened centrality, experience, number of hospital admissions, and prevalence of alcohol abuse, COPD, and type 2 diabetes. To implement PSM, I use logistic regression to estimate ( ) by modelling the main effects of centrality, experience, share of female GPs, and number of GP services. I match each DM practice to the three nearest non-DM practices in terms of estimated propensity scores. To implement PSW, I use logistic regression to estimate ( ) by modelling the main effects of centrality, experience, share of female GPs, number of GP services and hospital admissions, patient age groups and income quartiles, and prevalence of substance abuse. I weigh non-DM practices using estimated propensity scores. It should be stated that all choices regarding implementation add uncertainty to the treatment effect estimates. However, this is usually not seen as a major issue because the correct choices are those that maximise balance (Ho et al., 2007). Columns 4-6 in Table 2 show the ASMDs between the DM group and the CEM, PSM, and PSW control groups, respectively. For reference, the DM group is compared with the non-DM group in the original data in Column 3. The sum of the ASMDs is shown at the end of Table 2. According to this summary statistic, all methods significantly improve the overall mean balance. However, CEM and PSW slightly outperform PSM   Column 2 reports the sample mean of observed baseline characteristics for practices that participate in the PCT pilot with the DM. Standard deviation in parenthesis. Columns 4-6 report the absolute standardised mean difference (ASMD) between the DM group and the coarsened exact matching (CEM), propensity score matching (PSM), and propensity score weighting (PSW) control groups, respectively. For reference, the DM group is compared to the non-DM group in the original data in Column 3. The PSW control group only has 367 observations. This is because no DM practice has a centrality of 2 or 3, causing non-DM practices with a centrality of 2 or 3 to be dropped when I include this covariate as a predictor in the logistic regression model.
The results are hopeful, but to adjust completely for confounding, the entire distribution of covariates must be balanced, not just means. Although techniques for comparing covariate distributions are available, they are not applicable because of the size of the DM group. Instead, it is useful to investigate the crucial identifying assumption of DD; that is, that the DM and non-DM groups shared a common trend in outcome before the PCT pilot was announced. The degree to which this assumption holds can be assessed graphically. To this end, Figures 1-3 plot the average rates of three utilisation outcomes for the DM and non-DM groups over five years, three of which were pre-announcement. These figures show that pre-trends in outcomes were similar, suggesting that the common trend assumption is reasonable. 4

Discussion
Although team-based care has shown promise in delivering the comprehensive and coordinated care that is expected from modern primary care, the promotion of this type of primary care delivery model should be guided by carefully evaluated trials. A key challenge facing evaluation is selection bias. To account for selection bias in case of the DM, I compare three common matching methods. I find that CEM, PSM, and PSW all produce a control group that resembles the treated group in terms of both observed baseline characteristics and pre-trends in utilisation outcomes. This makes a convincing qualitative argument that either of the control groups are suitable for DD analysis, which suggests using PSW because it uses all available data. However, given the size of the DM group, the power to detect real effects remains a serious concern. Without adequate statistical power, a lack of significant findings would not necessarily indicate that the DM does not work. Instead, the findings may be false negatives. Moreover, DD analysis relies on the interaction between time and treatment status, which is notoriously lacking in power (Strumpf et al., 2017b). That said, identification in DD analysis often arises from changes in treatment status by a small number of units, and remedies are available. Some remedies aim to reduce the minimum detectable effect (MDE). For example, Peikes et al. (2011) show how measuring healthcare utilisation among high utilisers, such as the chronically ill, greatly reduces MDEs. Redefining healthcare utilisation to be, for example, binary, also reduces MDEs, although possibly at the expense of lower apparent treatment effects. Notably, the concern that linear probability models might predict outcomes outside the [0,1] range does not carry much weight in DD analysis. In contrast, patient-reported measures have low MDEs and should be easily estimated. Other remedies involve the use of alternative inference approaches. Two prominent examples are Brewer et al. (2018) and Conley and Taber (2011). Both feature simulations that show how their proposed estimators improve power considerably compared to the standard ordinary least squares estimator in DD analysis. A drawback of these remedies is that they require patient-level data. If neither patient-level data nor adequate statistical power is attainable, confidence intervals can still be informative about the presence of clinically or economically important differences between the treated and control groups.
It should also be stated that neither CEM, PSM, nor PSW is able to balance all the observed characteristics. This can occur when the treated group is small because one cannot afford to discard treated units regardless of whether suitable untreated units are available. While some differences in observed characteristics that can be expected to persist over time may be acceptable, as DD adjusts for all fixed differences between groups, such differences may raise concerns about potential unobserved time-varying differences. In this case, one may prefer the synthetic control (SC) method (Abadie et al., 2010), which is a weighted combination of untreated units where the weights are chosen so that the pre-trend in the outcome of the synthetic control mirrors that of the treated unit, thus allowing for unobserved time-varying differences. A practical drawback of this method is that one must fit separate synthetic controls for each treated unit. Moreover, to ensure its credibility, one should restrict the untreated group to units with characteristics similar to the treated unit, have many pre-period data points, and avoid solely optimising the pre-trend in outcome while ignoring other characteristics that influence future outcome values (Kaul et al., 2017). However, if these requirements are met, then this is an attractive method to account for selection bias. Moreover, recent extensions accommodate not only multiple treated units, but also complex designs, such as staggered adoption and treatment reversal (Abadie, 2021).
Ultimately, my findings suggest that matching methods can be used in comparative case studies to account for selection bias when the treated group is very small and distinct. This result is especially useful in situations where the SC method is unfit. Additionally, I identify important predictors of joining the DM. This may be of particular interest to policymakers seeking to increase participation in new team-based primary care delivery models.