Impact Evaluation Method for Evaluating Development Programs

06 Apr

Posted by: Kultar Singh

Category: Research and M&E

Impact evaluation is the most popular summative evaluation method used to ascertain the project’s impact on achieving the desired outcome. The impact evaluation process determines the project’s effect by determining if the project’s activities and tasks accomplished the project’s purpose and goal. It evaluates the overall effect of the programme and is a subset of evaluations that seek to establish the relationship between a project’s intervention and the program’s impact. Its primary objective is to establish a causal relationship between project intervention and project effect. In contrast to other assessment methods, the fundamental issue in impact evaluation is establishing a causal relationship to ascribe the change to the project intervention. The most effective way to address the attribution dilemma is to develop a counterfactual to determine what would have happened without the intervention.

One of the key objectives of impact evaluation is to minimize threats to internal validity, which limits the relationship between intervention and outcome, and threats to external validity, compromising our confidence in stating whether the evaluation result is generalizable to the population.

Impact Evaluation Design

Evaluation design can be broadly classified into three categories: Experimental, Quasi-Experimental, and Non-experimental design.

Experimental Design

Randomized Control Design (RCT)

The most robust design for the evaluation is experimental design, i.e., RCT, which provides a strong counterfactual to assess attribution of the project intervention. In terms of assessment methods, the randomized controlled trial is one of the simplest yet most effective. RCTs are regarded as the gold standard because randomization assures that, on average, all other potential causes are the same between the two groups (one that is getting the intervention and the other that is not). Hence, a substantial variation in the result event may be traced to intervention.

Further, it is also important to differentiate between RCTs and CRTs (Cluster randomized trials). In the case of RCTs, the unit of randomization is an individual, whereas in the case of CRTs, randomization happens at a cluster level, i.e., at a village level (rather than at the household level) and school level (rather than at the student level). After deciding the intervention and finalizing the level of randomization, one of the critical aspects is to select a way for randomization.

The first method is simple randomization (or complete randomization), wherein one can use a random number generator or lottery system to assign intervention to treatment and control areas. Stratification involves dividing the sample into groups with similar characteristics related to outcomes and randomizing within the strata. This approach has the advantage of ensuring that the treatment and control samples are balanced across different characteristics. One can also use blocking as a randomization method.

Further, it is crucial to ensure that the sample size for RCT is sufficiently large to statistically detect the minimum detectable effect size (MDES) for the outcomes of interest. One can use standard formulae for computing the sample size for a given power and MDES or use software such as STATA.

In theory, since randomization should ensure that the treatment and control groups are similar in all respects other than receipt of the intervention, we could in theory estimate the impact simply by computing the difference in mean outcomes between the two groups. However, it is better to estimate the impact of each intervention in a regression framework, ensuring treatment and control are balanced on baseline characteristics.

Stepped Wedge Approach/Pipeline Approach

Stepped wedge randomized trial, also known as lottery design or pipeline design, is an impact evaluation design. The comparison group includes those who have not yet received the intervention but are scheduled to receive the intervention. Cook and Campbell were among the first to consider this situation wherein this design can be employed. The first empirical example of this design is in the Gambia Hepatitis Study, and it is from this study that the name ‘Stepped Wedge’ was coined as it resembled a stepped wedge.

The designs involve a sequential roll-out of an intervention to participants over several periods. But the important point to highlight is that the order in which participants receive the intervention is randomly determined. In terms of analysis, Stepped Wedge design is similar to RCT design, wherein a simple difference of mean across treatment and comparison group can be used to estimate the programme. Further, one can also use a regression equation to estimate the effect of the intervention.

Quasi-experimental

Another method for evaluating an existing campaign is a quasi-experimental design (based on the counterfactual). The section below lists down the key quasi-experimental designs:

Difference in Difference

Double difference or difference-in-differences methods compare a treatment and a comparison group (first difference) before and after the intervention (second difference). This method can be applied in experimental and quasi-experimental designs and requires baseline and follow-up data from the same treatment and control group. It is suggested that impact evaluation shall attempt to compute the difference-in-differences (DID) estimator to evaluate the effects of interventions and other treatments of interest on the outcome variable. To assess the difference in difference study, one must create a dummy variable of treatment and post to conceptualize a regression equation as described below:

y_i = β₀ + β₁ treatmenti + β₂ posti + β₃ treatmenti*posti + e_i

Wherein post is a dummy variable, which =1 for endline, and =0 for before.

Treatment is a dummy variable, which =1 if an individual is in the treatment block and =0 if the individual is not.

The estimate for β₃ is the DD estimator. It is the differential effect of treatment.

β₂ represents the time trend in the control group,

β₁ represents the differences between the two at the Baseline

Table: Computing the DID estimator

	Treatment	Control	Difference
Before	β₀ + β₁	β₀	β₁
After	β₀ + β₁ + β₂ + β₃	β₀ + β₂	β₁ + β₃
Difference	β₂ + β₃	β₂	β₃

Propensity score matching

Propensity score matching is another popular and widely used quasi-experimental method for impact evaluation. In addition to being used solely for the impact evaluation, it can also complement the Difference in Difference method to refine the estimate by balancing the treatment and comparison areas.

The propensity score was defined by Rosenbaum and Rubin (1983a) to be the probability of treatment assignment conditional on observed baseline covariates. The propensity score can be expressed in terms of the probability of participating in the program (being treated) as a function of the individual’s observed characteristics.

P(X) = Prob(D = 1|X)

Where D indicates participation in a project

X is the set of observable characteristics

To measure the effect of a program, we maintain the assumption of selection on observables, i.e., assume that participation is independent of outcomes conditional on Xi

E (Y|X, D = 1) = E (Y|X, D = 0)

if there had not been a program.

Operationally, one can follow a sequence of steps for assessing impact using PSM viz; a) Creating a dichotomous variable for the two groups, i.e., project and comparison area b) Generating propensity score using logistic/probit estimation to give each household a propensity score c) Balancing the matched set of units to ensure balancing properties are satisfied d) Computing Average treatment effect using local linear regression matching and ensuring common support.

While working with PSM, it is crucial to consider that a large and comparable sample size (in both treatment and comparison groups) is achieved to get a substantial number of appropriate matches. It is also important to get rich data on as many observable characteristics as possible and use the same set of questions in the treatment and comparison group.

Regression Discontinuity

Regression Discontinuity is a unique quasi-experimental technique for estimating the causal effect of intervention by generating counterfactuals based on a cutoff that separates participants from non-participants. Ronald L. Thistlethwaite and Donald T. Campbell (1960) pioneered Regression Discontinuity (RD) designs to estimate treatment effects in a non-experimental situation where treatment is chosen when the observable assignment variable exceeds a predefined cutoff point. In this design, evaluators compare individuals with values above and below the cutoff point to estimate a treatment effect. Although the RD design has the clear structure of an experimental design, it lacks the random assignment feature.

Depending on the nature of the cutoff, regression discontinuity can be sharp. As the name suggests, all observations on one side of the cutoff are treated, whereas all other observations are non-participants. In the case of fuzzy regression discontinuity design, the probability of assigning is fuzzy, and observation from the other side can also be unclear.

In terms of advantages, RD relies on the assumption that variation in treatment is random in a neighborhood around the cutoff and hence can work as an experiment to assess the effect. Though this approach works on a smaller subpopulation to assess the effect of information on treatment, it may not be necessarily generalized to a broader population.

Instrumental Variable/ Randomization Promotion

A method called Instrumental Variable can help us evaluate programs with imperfect compliance, voluntary enrollment, or universal coverage. Instrumental Variable uses one or more variables to control participation that are not related to the outcome. It is relevant when the exposure to an intervention is to a certain degree determined by an external force affecting the outcome of the policy only indirectly through an influence on the exposure.

Randomized encouragement designs solve the evaluation problem by randomly varying incentives to participate in a program without affecting outcomes of interest, thus making it possible to measure the average treatment effects. A randomized offer is suitable when you can exclude certain individuals from the intervention but cannot compel them to participate.

Multiple Baseline Design

In Multiple Baseline Design, as the name suggests, baseline data is collected multiple times. The concept of Multiple Baseline Design is that before any intervention occurs, the outcome is measured in each group and after enough time has passed for the intervention to affect the outcome in the first group(s), outcome measurements are conducted in all groups and the intervention is introduced in one or more additional groups and this process continues until all groups receive the intervention. There are two variations of the Multiple Baseline Design, the multiple probe design, and the delayed baseline design.

Further, in terms of analysis, the basic and foremost technique to analyze Multiple Baseline Design is visual analysis. The analysis protocol is to look for a shift in the mean outcome and can be analyzed by simple visual inspection for a substantial change (Zhan & Ottenbacher, 2001), and by looking at the change of slope visually, we can estimate the reliability or consistency of intervention effects (Long & Hollin, 1995). The other methods used are Box-Jenkins (ARIMA) method and Split Middle Technique.

By using the Split middle technique, one can reveal the nature of the trend in the data by plotting linear trend lines that best fit the data and then applying a binomial test to see whether the number of data points in the intervention phase falls above/ below the projected line of the baseline (Kazdin, 1982; Kinugasa et al., 2004). Further, Randomization Tests and Multilevel modelling can also analyze multiple baseline design study. The levels of the analysis can be decided by the nature of Multiple Baseline Design.

Interrupted Time-Series Designs

Interrupted Time-Series Designs can also be employed for impact evaluation. In the case of an Interrupted Time-Series Designs, observations are made both before and after the treatment. Evidence for intervention impact can be ascertained because of discontinuities in the time-series data when treatment was implemented. In the Interrupted Time-Series Designs, participants are pre and post-tested several times after or during exposure to the treatment condition. A pretesting phase is called baseline when observation of behavior before treatment is collected, and the treatment effect is observed if post-treatment responses pattern differs from the baseline.

Non-experimental Design

Non-experimental design primarily consists of-

Single group post-test only design
Single group pre-test/post-test design

Single group post-test only design:

In this design, beneficiaries/clients of an intervention are assessed after the treatment has taken place. Participants, for example, maybe asked to rate the intervention’s effectiveness by answering a simple question.

Single group Pre-test/ post-test design:

In this design, a single group is studied using before-and-after measurements. When evaluating the effectiveness of a training programme, a knowledge test may be used both before and after the training session to gauge its results.

Counterfactuals cannot be constructed at the field level in non-experiment design, but statistical counterfactuals may be generated if the design and data allow.

References:

Baker, J. 2000. Evaluating the Impact of Development Projects on Poverty—A Handbook for Practitioners. World Bank, Washington, DC. Boruch, R. F. 1996. Randomized Experiments for Planning and Evaluation: A Practical Guide. Thousand Oaks, CA: Sage Publications.
Campbell, D. T., and J. C. Stanley. 1966. Experiment and QuasiExperimental Designs for Research. Chicago: Rand McNally
Duflo, E., and M. Kremer. 2003. “Use of Randomization in the Evaluation of Development Effectiveness.” Paper prepared for the Conference on Evaluation and Development Effectiveness, 15–16 July, World Bank, Washington, DC.
Hall A, Inskip H, Loik N, Day, O’Conor G, Bosch X, Muir C, Parkin M, Munoz N, Tomatis L, Greenwood B, Whittle H, Ryder R, Oldfield F, N’jie A, Smith P, Coursaget P. The gambia hepatitis intervention study. Cancer Research. 1987;47:5782–5787.
Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. New York: Oxford University Press
Khandker, Shahidur R.; Koolwal, Gayatri B.; Samad, Hussain A.. 2010. Handbook on Impact Evaluation : Quantitative Methods and Practices. World Bank. © World Bank. https://openknowledge.worldbank.org/handle/10986/2693 License: CC BY 3.0 IGO.
Kinugasa, T., Cerin, E., & Hooper, S. (2004). Single-subject research designs and data analyses for assessing elite athletes’ conditioning. Sports Medicine, 34, 1035-1050
Long, C. G., & Hollin, C. R. (1995). Single case design: A critique of methodology and analysis of recent trends. Clinical Psychology & Psychotherapy, 2, 177-191
Rosenbaum P.R., Rubin D.B. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983a;70:41–55
Thistlethwaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex post facto experiment. Journal of Educational Psychology, 51(6), 309–317
Zhan, S., & Ottenbacher, K. J. (2001). Single subject research designs for disability research. Disability & Rehabilitation, 23, 1-8.