This website may not work correctly because your browser is out of date. Please update your browser .

Impact evaluation

An impact evaluation provides information about the observed changes or 'impacts' produced by an intervention.

These observed changes can be positive and negative, intended and unintended, direct and indirect. An impact evaluation must establish the cause of the observed changes. Identifying the cause is known as 'causal attribution' or 'causal inference'.

If an impact evaluation fails to systematically undertake causal attribution, there is a greater risk that the evaluation will produce incorrect findings and lead to incorrect decisions. For example, deciding to scale up when the programme is actually ineffective or effective only in certain limited situations or deciding to exit when a programme could be made to work if limiting factors were addressed.

1. What is impact evaluation?

An impact evaluation provides information about the impacts produced by an intervention.

The intervention might be a small project, a large programme, a collection of activities, or a policy.

Many development agencies use the definition of impacts provided by the Organisation for Economic Co-operation and Development – Development Assistance Committee :

"Positive and negative, primary and secondary long-term effects produced by a development intervention, directly or indirectly, intended or unintended." (OECD-DAC 2010)

This definition implies that impact evaluation:

  • goes beyond describing or measuring impacts that have occurred to seeking to understand the role of the intervention in producing these (causal attribution);
  • can encompass a broad range of methods for causal attribution; and,
  • includes examining unintended impacts.

2. Why do impact evaluation?

An impact evaluation can be undertaken to improve or reorient an intervention (i.e., for formative purposes) or to inform decisions about whether to continue, discontinue, replicate or scale up an intervention (i.e., for summative purposes).

While many formative evaluations focus on processes, impact evaluations can also be used formatively if an intervention is ongoing. For example, the findings of an impact evaluation can be used to improve implementation of a programme for the next intake of participants by identifying critical elements to monitor and tightly manage.

Most often, impact evaluation is used for summative purposes. Ideally, a summative impact evaluation does not only produce findings about ‘what works’ but also provides information about what is needed to make the intervention work for different groups in different settings.

3. When to do impact evaluation?

An impact evaluation should only be undertaken when its intended use can be clearly identified and when it is likely to be able to produce useful findings, taking into account the availability of resources and the timing of decisions about the intervention under investigation. An  evaluability assessment  might need to be done first to assess these aspects.

Prioritizing interventions for impact evaluation should consider: the relevance of the evaluation to the organisational or development strategy; its potential usefulness; the commitment from senior managers or policy makers to using its findings; and/or its potential use for advocacy or accountability requirements.

It is also important to consider the timing of an impact evaluation. When conducted belatedly, the findings come too late to inform decisions. When done too early, it will provide an inaccurate picture of the impacts (i.e., impacts will be understated when they had insufficient time to develop or overstated when they decline over time).

What are the intended uses and timings?

Impact evaluation might be appropriate when there is scope to use the findings to inform decisions about future interventions

It might not be appropriate when there are no clear intended uses or intended users. For example, if decisions have already been made on the basis of existing credible evidence, or if decisions need to be made before it is possible to undertake a credible impact evaluation

What is the current focus?

Impact evaluation might be appropriate when there is a need to understand the impacts that have been produced. 

It might not be appropriate when the priority at this stage is to understand and improve the quality of the implementation. 

Are there adequate resources to do the job? 

Impact evaluation might be appropriate when there are adequate resources to undertake a sufficiently comprehensive and rigorous impact evaluation, including the availability of existing, good quality data and additional time and money to collect more. 

It might not be appropriate when existing data are inadequate and there are insufficient resources to fill gaps with new, good quality data collection. 

Is it relevant to current strategies and priorities?

Impact evaluation might be appropriate when it is clearly linked to the strategies and priorities of an organisation, partnership and/or government. 

It might not be appropriate when it is peripheral to the strategies and priorities of an organisation, partnership and/or government. 

4. Who to engage in the evaluation process?

Regardless of the type of evaluation, it is important to think through who should be involved, why and how they will be involved in each step of the evaluation process to develop an appropriate and context-specific participatory approach. Participation can occur at any stage of the impact evaluation process: in deciding to do an evaluation, in its design, in data collection, in analysis, in reporting and, also, in managing it.

Being clear about the purpose of participatory approaches in an impact evaluation is an essential first step towards managing expectations and guiding implementation. Is the purpose to ensure that the voices of those whose lives should have been improved by the programme or policy are central to the findings? Is it to ensure a relevant evaluation focus? Is it to hear people’s own versions of change rather than obtain an external evaluator’s set of indicators? Is it to build ownership of a donor-funded programme? These, and other considerations, would lead to different forms of participation by different combinations of stakeholders in the impact evaluation.

The underlying rationale for choosing a participatory approach to impact evaluation can be either pragmatic or ethical, or a combination of the two. Pragmatic because better evaluations are achieved (i.e. better data, better understanding of the data, more appropriate recommendations, better uptake of findings); ethical because it is the right thing to do (i.e. people have a right to be involved in informing decisions that will directly or indirectly affect them, as stipulated by the UN human rights-based approach to programming).

Participatory approaches can be used in any impact evaluation design. In other words, they are not exclusive to specific evaluation methods or restricted to quantitative or qualitative data collection and analysis.

The starting point for any impact evaluation intending to use participatory approaches lies in clarifying what value this will add to the evaluation itself as well as to the people who would be closely involved (but also including potential risks of their participation). Three questions need to be answered in each situation:

(1) What purpose will stakeholder participation serve in this impact evaluation?;

(2) Whose participation matters, when and why?; and,

(3) When is participation feasible?

Only after addressing these, can the issue of how to make impact evaluation more participatory be addressed. 

Read more on who to engage in the evaluation process:

  • The  BetterEvaluation Rainbow Framework  provides a good overview of the key stages in the evaluation process during which the question ‘Who is best involved?’ can be asked. These stages involve: managing the impact evaluation, defining and framing the evaluation focus, collecting data on impacts, explaining impacts, synthesising findings, and reporting on and supporting the use of the evaluation findings.
  • Understand and engage stakeholders
  • Participatory Evaluation
  • UNICEF Brief 5. Participatory Approaches

5. How to plan and manage an impact evaluation?

Like any other evaluation, an impact evaluation should be planned formally and managed as a discrete project, with decision-making processes and management arrangements clearly described from the beginning of the process.

Planning and managing include:

  • Describing what needs to be evaluated and developing the evaluation brief
  • Identifying and mobilizing resources
  • Deciding who will conduct the evaluation and engaging the evaluator(s)
  • Deciding and managing the process for developing the evaluation methodology
  • Managing development of the evaluation work plan
  • Managing implementation of the work plan including development of reports
  • Disseminating the report(s) and supporting use

Determining causal attribution is a requirement for calling an evaluation an impact evaluation. The design options (whether experimental, quasi-experimental, or non-experimental) all need significant investment in preparation and early data collection, and cannot be done if an impact evaluation is limited to a short exercise conducted towards the end of intervention implementation. Hence, it is particularly important that impact evaluation is addressed as part of an integrated monitoring, evaluation and research plan and system that generates and makes available a range of evidence to inform decisions. This will also ensure that data from other M&E activities such as performance monitoring and process evaluation can be used, as needed.

Read more on how to plan and manage an impact evaluation:

  • Plan and manage an evaluation
  • Establish decision making processes
  • Determine and secure resources
  • Document management processes and agreements
  • UNICEF Brief 1. Overview of impact evaluation

6. What methods can be used to do impact evaluation?

Framing the boundaries of the impact evaluation.

The evaluation purpose refers to the rationale for conducting an impact evaluation. Evaluations that are being undertaken to support learning should be clear about who is intended to learn from it, how they will be engaged in the evaluation process to ensure it is seen as relevant and credible, and whether there are specific decision points around where this learning is expected to be applied. Evaluations that are being undertaken to support accountability should be clear about who is being held accountable, to whom and for what.

Evaluation relies on a combination of facts and values (i.e., principles, attributes or qualities held to be intrinsically good, desirable, important and of general worth such as ‘being fair to all’) to judge the merit of an intervention (Stufflebeam 2001). Evaluative criteria specify the values that will be used in an evaluation and, as such, help to set boundaries.

Many impact evaluations use the standard OECD-DAC criteria (OECD-DAC accessed 2015):

  • Relevance : The extent to which the objectives of an intervention are consistent with recipients’ requirements, country needs, global priorities and partners’ policies.
  • Effectiveness : The extent to which the intervention’s objectives were achieved, or are expected to be achieved, taking into account their relative importance. 
  • Efficiency : A measure of how economically resources/inputs (funds, expertise, time, equipment, etc.) are converted into results.
  • Impact : Positive and negative primary and secondary long-term effects produced by the intervention, whether directly or indirectly, intended or unintended.
  • Sustainability : The continuation of benefits from the intervention after major development assistance has ceased. Interventions must be both environmentally and financially sustainable. Where the emphasis is not on external assistance, sustainability can be defined as the ability of key stakeholders to sustain intervention benefits – after the cessation of donor funding – with efforts that use locally available resources.

The OECD-DAC criteria reflect the core principles for evaluating development assistance (OECD-DAC 1991) and have been adopted by most development agencies as standards of good practice in evaluation. Other, commonly used evaluative criteria are about equity, gender equality, and human rights.  And, some are used for particular types of development interventions such humanitarian assistance such as: coverage, coordination, protection, coherence.  In other words, not all of these evaluative criteria are used in every evaluation, depending on the type of intervention and/or the type of evaluation (e.g., the criterion of impact is irrelevant to a process evaluation).

Evaluative criteria should be thought of as ‘concepts’ that must be addressed in the evaluation. They are insufficiently defined to be applied systematically and in a transparent manner to make evaluative judgements about the intervention. Under each of the ‘generic’ criteria, more specific criteria such as benchmarks and/or standards* – appropriate to the type and context of the intervention – should be defined and agreed with key stakeholders.

The evaluative criteria should be clearly reflected in the evaluation questions the evaluation is intended to address.

*A benchmark or index is a set of related indicators that provides for meaningful, accurate and systematic comparisons regarding performance; a standard or rubric is a set of related benchmarks/indices or indicators that provides socially meaningful information regarding performance.

Defining the key evaluation questions (KEQs) the impact evaluation should address

Impact evaluations should be focused around answering a small number of high-level key evaluation questions (KEQs) that will be answered through a combination of evidence. These questions should be clearly linked to the evaluative criteria. For example:

  • KEQ1: What was the quality of the intervention design/content? [ assessing relevance, equity, gender equality, human rights ]
  • KEQ2: How well was the intervention implemented and adapted as needed? [ assessing effectiveness, efficiency ]
  • KEQ3: Did the intervention produce the intended results in the short, medium and long term? If so, for whom, to what extent and in what circumstances? [ assessing effectiveness, impact, equity, gender equality ]
  • KEQ4: What unintended results – positive and negative – did the intervention produce? How did these occur? [ assessing effectiveness, impact, equity, gender equality, human rights ]
  • KEQ5: What were the barriers and enablers that made the difference between successful and disappointing intervention implementation and results? [ assessing relevance, equity, gender equality, human rights ]
  • KEQ6: How valuable were the results to service providers, clients, the community and/or organizations involved? [ assessing relevance, equity, gender equality, human rights ]
  • KEQ7: To what extent did the intervention represent the best possible use of available resources to achieve results of the greatest possible value to participants and the community? [ assessing efficiency ]
  • KEQ8: Are any positive results likely to be sustained? In what circumstances? [ assessing sustainability, equity, gender equality, human rights ]

A range of more detailed (mid-level and lower-level) evaluation questions should then be articulated to address each evaluative criterion in detail. All evaluation questions should be linked explicitly to the evaluative criteria to ensure that the criteria are covered in full.

The KEQs also need to reflect the intended uses of the impact evaluation. For example, if an evaluation is intended to inform the scaling up of a pilot programme, then it is not enough to ask ‘Did it work?’ or ‘What were the impacts?’. A good understanding is needed of how these impacts were achieved in terms of activities and supportive contextual factors to replicate the achievements of a successful pilot. Equity concerns require that impact evaluations go beyond simple average impact to identify for whom and in what ways the programmes have been successful. 

Within the KEQs, it is also useful to identify the different types of questions involved – descriptive, causal and evaluative.

  • Descriptive questions  ask about how things are and what has happened, including describing the initial situation and how it has changed, the activities of the intervention and other related programmes or policies, the context in terms of participant characteristics, and the implementation environment.
  • Causal questions  ask whether or not, and to what extent, observed changes are due to the intervention being evaluated rather than to other factors, including other programmes and/or policies.
  • Evaluative questions  ask about the overall conclusion as to whether a programme or policy can be considered a success, an improvement or the best option.

Read more on defining the key evaluation questions (KEQs) the impact evaluation should address:

  • Specify key evaluation questions
  • UNICEF Brief 3. Evaluative Criteria

Defining impacts

Impacts are usually understood to occur later than, and as a result of, intermediate outcomes. For example, achieving the intermediate outcomes of improved access to land and increased levels of participation in community decision-making might occur before, and contribute to, the intended final impact of improved health and well-being for women. The distinction between outcomes and impacts can be relative, and depends on the stated objectives of an intervention. It should also be noted that some impacts may be emergent, and thus, cannot be predicted.

Read more on defining impacts:

  • Use measures, indicators or metrics
  • UNICEF Brief 11. Developing and Selecting Measures of Child Well-Being

Defining success to make evaluative judgements

Evaluation, by definition, answers  evaluative  questions, that is, questions about quality and value. This is what makes evaluation so much more useful and relevant than the mere measurement of indicators or summaries of observations and stories.

In any impact evaluation, it is important to define first what is meant by ‘success’ (quality, value). One way of doing so is to use a specific rubric that defines different levels of performance (or standards) for each evaluative criterion, deciding what evidence will be gathered and how it will be synthesized to reach defensible conclusions about the worth of the intervention.

At the very least, it should be clear what trade-offs would be appropriate in balancing multiple impacts or distributional effects. Since development interventions often have multiple impacts, which are distributed unevenly, this is an essential element of an impact evaluation. For example, should an economic development programme be considered a success if it produces increases in household income but also produces hazardous environmental impacts? Should it be considered a success if the average household income increases but the income of the poorest households is reduced?

To answer evaluative questions, what is meant by ‘quality’ and ‘value’ must first be defined and then relevant evidence gathered. Quality refers to how good something is; value refers to how good it is in terms of the specific situation, in particular taking into account the resources used to produce it and the needs it was supposed to address. Evaluative reasoning is required to synthesize these elements to formulate defensible (i.e., well-reasoned and well-evidenced) answers to the evaluative questions.

Evaluative reasoning is a requirement of all evaluations, irrespective of the methods or evaluation approach used.

An evaluation should have a limited set of high-level questions which are about performance overall. Each of these KEQs should be further unpacked by asking more detailed questions about performance on specific dimensions of merit and sometimes even lower-level questions. Evaluative reasoning is the process of synthesizing the answers to lower- and mid-level questions into defensible judgements that directly answer the high-level questions.

Read more on defining success to make evaluative judgements:

  • Determine what success looks like
  • Evaluation rubrics: How to ensure transparent and clear assessment that respects diverse lines of evidence
  • UNICEF Brief 4. Evaluative Reasoning

Using a theory of change

Evaluations produce stronger and more useful findings if they not only investigate the links between activities and impacts but also investigate links along the causal chain between activities, outputs, intermediate outcomes and impacts. A ‘theory of change’ that explains how activities are understood to produce a series of results that contribute to achieving the ultimate intended impacts, is helpful in guiding causal attribution in an impact evaluation.

A theory of change should be used in some form in every impact evaluation. It can be used with any research design that aims to infer causality, it can use a range of qualitative and quantitative data, and provide support for triangulating the data arising from a mixed methods impact evaluation.

When planning an impact evaluation and developing the terms of reference, any existing theory of change for the programme or policy should be reviewed for appropriateness, comprehensiveness and accuracy, and revised as necessary. It should continue to be revised over the course of the evaluation should either the intervention itself or the understanding of how it works – or is intended to work – change.

Some interventions cannot be fully planned in advance, however – for example, programmes in settings where implementation has to respond to emerging barriers and opportunities such as to support the development of legislation in a volatile political environment. In such cases, different strategies will be needed to develop and use a theory of change for impact evaluation  (Funnell and Rogers 2012). For some interventions, it may be possible to document the emerging theory of change as different strategies are trialled and adapted or replaced. In other cases, there may be a high-level theory of how change will come about (e.g., through the provision of incentives) and also an emerging theory about what has to be done in a particular setting to bring this about. Elsewhere, its fundamental basis may revolve around adaptive learning, in which case the theory of change should focus on articulating how the various actors gather and use information together to make ongoing improvements and adaptations.

A theory of change can support an impact evaluation in several ways. It can identify:

  • specific evaluation questions, especially in relation to those elements of the theory of change for which there is no substantive evidence yet
  • relevant variables that should be included in data collection 
  • intermediate outcomes that can be used as markers of success in situations where the impacts of interest will not occur during the time frame of the evaluation
  • aspects of implementation that should be examined
  • potentially relevant contextual factors that should be addressed in data collection and in analysis, to look for patterns.

The evaluation may confirm the theory of change or it may suggest refinements based on the analysis of evidence. An impact evaluation can check for success along the causal chain and, if necessary, examine alternative causal paths. For example, failure to achieve intermediate results might indicate implementation failure; failure to achieve the final intended impacts might be due to theory failure rather than implementation failure. This has important implications for the recommendations that come out of an evaluation. In cases of implementation failure, it is reasonable to recommend actions to improve the quality of implementation; in cases of theory failure, it is necessary to rethink the whole strategy to achieve impact. 

Read more on using a theory of change:

  • Develop programme theory/ theory of change
  • UNICEF Brief 2. Theory of Change

Deciding the evaluation methodology

The evaluation methodology sets out how the key evaluation questions (KEQs) will be answered. It specifies designs for causal attribution, including whether and how comparison groups will be constructed, and methods for data collection and analysis.

Strategies and designs for determining causal attribution

Causal attribution is defined by OECD-DAC as:

“Ascription of a causal link between observed (or expected to be observed) changes and a specific intervention.” (OECD_DAC 2010)

This definition does not require that changes are produced solely or wholly by the programme or policy under investigation (UNEG 2013). In other words, it takes into consideration that other causes may also have been involved, for example, other programmes/policies in the area of interest or certain contextual factors (often referred to as ‘external factors’).

There are three broad strategies for causal attribution in impact evaluations:

  • estimating the counterfactual (i.e., what would have happened in the absence of the intervention, compared to the observed situation)
  • checking the consistency of evidence for the causal relationships made explicit in the theory of change
  • ruling out alternative explanations, through a logical, evidence-based process.

Using a combination of these strategies can usually help to increase the strength of the conclusions that are drawn.

There are three design options that address causal attribution:

  • Experimental designs  – which construct a control group through random assignment.
  • Quasi-experimental designs  – which construct a comparison group through matching, regression discontinuity, propensity scores or another means.
  • Non-experimental designs  – which look systematically at whether the evidence is consistent with what would be expected if the intervention was producing the impacts, and also whether other factors could provide an alternative explanation.

Some individuals and organisations use a narrower definition of impact evaluation, and only include evaluations containing a counterfactual of some kind.  These different definitions are important when deciding what methods or research designs will be considered credible by the intended user of the evaluation or by partners or funders.

Read more on strategies and designs for determining causal attribution:

  • Understand causes
  • Compare results to the counterfactual
  • Randomised controlled trial
  • Better use of case studies in evaluation
  • UNICEF Brief 6. Overview: Strategies for Causal Attribution
  • UNICEF Brief 7. Randomized Controlled Trials
  • UNICEF Brief 8. Quasi-experimental Designs and Methods
  • UNICEF Brief 13. Modelling
  • UNICEF Brief 9. Comparative Case Studies

Data collection, management and analysis approach

Well-chosen and well-implemented methods for data collection and analysis are essential for all types of evaluations. Impact evaluations need to go beyond assessing the size of the effects (i.e., the average impact) to identify for whom and in what ways a programme or policy has been successful. What constitutes ‘success’ and how the data will be analysed and synthesized to answer the specific key evaluation questions (KEQs) must be considered upfront as data collection should be geared towards the mix of evidence needed to make appropriate judgements about the programme or policy. In other words, the analytical framework – the methodology for analysing the ‘meaning’ of the data by looking for patterns in a systematic and transparent manner – should be specified during the evaluation planning stage. The framework includes how data analysis will address assumptions made in the programme theory of change about how the programme was thought to produce the intended results. In a true mixed methods evaluation, this includes using appropriate numerical and textual analysis methods and triangulating multiple data sources and perspectives in order to maximize the credibility of the evaluation findings.

Start the data collection planning by reviewing to what extent existing data can be used. After reviewing currently available information, it is helpful to create an evaluation matrix (see below) showing which data collection and analysis methods will be used to answer each KEQ and then identify and prioritize data gaps that need to be addressed by collecting new data. This will help to confirm that the planned data collection (and collation of existing data) will cover all of the KEQs, determine if there is sufficient triangulation between different data sources and help with the design of data collection tools (such as questionnaires, interview questions, data extraction tools for document review and observation tools) to ensure that they gather the necessary information. 

There are many different methods for collecting data. Although many impact evaluations use a variety of methods, what distinguishes a ’mixed meth­ods evaluation’ is the  systematic integration of  quantitative and qualitative methodologies and methods at all stages of an evaluation (Bamberger 2012). A key reason for mixing methods is that it helps to overcome the weaknesses inherent in each method when used alone. It also increases the credibility of evaluation findings when information from different data sources converges (i.e., they are consistent about the direction of the findings) and can deepen the understanding of the programme/policy, its effects and context (Bamberger 2012).

Good data management includes developing effective processes for: consistently collecting and recording data, storing data securely, cleaning data, transferring data (e.g., between different types of software used for analysis), effectively presenting data and making data accessible for verification and use by others.

The particular analytic framework and the choice of specific data analysis methods will depend on the purpose of the impact evaluation and the type of KEQs that are intrinsically linked to this.

For answering  descriptive  KEQs, a range of analysis options is available, which can largely be grouped into two key categories: options for quantitative data (numbers) and options for qualitative data (e.g., text).

For answering  causal  KEQs, there are essentially three broad approaches to causal attribution analysis: (1) counterfactual approaches; (2) consistency of evidence with causal relationship; and (3) ruling out alternatives (see above). Ideally, a combination of these approaches is used to establish causality.

For answering  evaluative  KEQs, specific evaluative rubrics linked to the evaluative criteria employed (such as the OECD-DAC criteria) should be applied in order to synthesize the evidence and make judgements about the worth of the intervention (see above).

Read more on data collection, management and analysis approach

  • Collect and/or Retrieve Data
  • Manage Data
  • Analyse Data
  • Combine Qualitative and Quantitative Data
  • UNICEF Brief 10. Overview: Data Collection and Analysis Methods in Impact Evaluation
  • UNICEF Brief 12. Interviewing

7. How can the findings be reported and their use supported?

The evaluation report should be structured in a manner that reflects the purpose and KEQs of the evaluation.

In the first instance, evidence to answer the detailed questions linked to the OECD-DAC criteria of relevance, effectiveness, efficiency, impact and sustainability, and considerations of equity, gender equality and human rights should be presented succinctly but with sufficient detail to substantiate the conclusions and recommendations.

The specific evaluative rubrics should be used to ‘interpret’ the evidence and determine which considerations are critically important or urgent. Evidence on multiple dimensions should subsequently be synthesized to generate answers to the high-level evaluative questions.

The structure of an evaluation report can do a great deal to encourage the succinct reporting of direct answers to evaluative questions, backed up by enough detail about the evaluative reasoning and methodology to allow the reader to follow the logic and clearly see the evidence base.

The following recommendations will help to set clear expectations for evaluation reports that are strong on evaluative reasoning:

The executive summary must contain direct and explicitly evaluative answers to the KEQs used to guide the whole evaluation.

Explicitly evaluative language must be used when presenting findings (rather than value-neutral language that merely describes findings). Examples should be provided.

Use of clear and simple data visualization to present easy-to-understand ‘snapshots’ of how the intervention has performed on the various dimensions of merit.

Structuring of the findings section using KEQs as subheadings (rather than types and sources of evidence, as is frequently done).

There must be clarity and transparency about the evaluative reasoning used, with the explanations clearly understandable to both non-evaluators and readers without deep content expertise in the subject matter. These explanations should be broad and brief in the main body of the report, with more detail available in annexes.

If evaluative rubrics are relatively small in size, these should be included in the main body of the report. If they are large, a brief summary of at least one or two should be included in the main body of the report, with all rubrics included in full in an annex.

Read more on how can the findings be reported and their use supported?

  • Develop reporting media
  • Visualise data

Page contributors

The content for this page was compiled by: Greet Peersman 

The content is based on ‘UNICEF Methodological Briefs for Impact Evaluation’, a collaborative project between the UNICEF Office of Research – Innocenti, BetterEvaluation, RMIT University and  the International Initiative for Impact Evaluation (3ie).The briefs were written by (in alphabetical order): E. Jane Davidson, Thomas de Hoop, Delwyn Goodrick, Irene Guijt, Bronwen McDonald, Greet Peersman, Patricia Rogers, Shagun Sabarwal, Howard White.

Overviews/introductions to impact evaluation

This paper, written by Patricia Rogers for UNICEF, outlines the basic ideas and principles of impact evaluation. It includes a discussion of the different elements and options for the different stages of conducting an impact evaluation.

Discussion Papers

This paper provides a summary of debates about measuring and attributing impacts.

This special edition of the IDS Bulletin presents contributions from the event 'Impact Innovation and Learning: Towards a Research and Practice Agenda for the Future', organised by IDS in March 2013.

This paper, written by Sandra Nutley, Alison Powell and Huw Davies for the Alliance for Useful Evidence, discusses the risks of using a hierarchy of evidence and suggests an alternative in which more complex matrix approaches for identifying evidence qu

  • InterAction Impact Evaluation Guidance Notes and Webinar Series

Rogers P (2012).  Introduction to Impact Evaluation. Impact Evaluation Notes No. 1 . Washington DC: InterAction. – This guidance note outlines the basic principles and ideas of Impact Evaluation including when, why, how and by whom it should be done. 

Perrin B (2012).  Linking Monitoring and Evaluation to Impact Evaluation. Impact Evaluation Notes No.2.   Washington DC: InterAction. – This guidance note outlines how monitoring and evaluation (M&E) activities can support meaningful and valid impact evaluation.

Bamberger M (2012).  Introduction to Mixed Methods in Impact Evaluation. Guidance Note No. 3.  Washington DC: InterAction. – This guidance note provides an outline of a mixed methods impact evaluation with particular reference to the difference between this approach and qualitative and quantitative impact evaluation designs.

Bonbright D (2012).  Use of Impact Evaluation Results. Guidance Note No. 4.  Washington DC: InterAction. – This guidance note highlights three themes that are crucial for effective utilization of evaluation results.

Realist impact evaluation is an approach to impact evaluation that emphasises the importance of context for programme outcomes.

“Gender affects everyone, all of the time. Gender affects the way we see each other, the way we interact, the institutions we create, the ways in which those institutions operate, and who benefits or suffers as a result of this.” (Fletcher 2015: 19)

This document provides an overview of the utility of and specific guidance and a tool for implementing an evaluability assessment before an impact evaluation is undertaken.

Many development programme staff have had the experience of commissioning an impact evaluation towards the end of a project or programme only to find that the monitoring system did not provide adequate data about implementation, context, baselines or in

  • Additional guidance documents can be found here

Over recent decades, governments everywhere have increased their scrutiny of public spending, and public universities have not escaped this scrutiny.

International development is fixated with impact. But how do we know we’re all talking about the same thing?

This blog post by Simon Hearn (ODI) was originally posted by Action to Research.

This week, EvalPartners will be launching EvalGender+, the global partnership for equity-focused and gender-responsive evaluations. The launch is part of the Global Evaluation Week in Kathmandu to celebrate the International Year of Evaluation.

Impact evaluation, like many areas of evaluation, is under-researched. Doing systematic research about evaluation takes considerable resources, and is often constrained by the availability of information about evaluation practice.  

Nikola Balvin, Knowledge Management Specialist at the UNICEF Office of Research – Innocenti, presents new resources on impact evaluation and discusses how they can be used to support managers who commission impact evaluations.

Designating something a “best practice” is a marketing ploy, not a scientific conclusion. Calling something “best” is a political and ideological assertion dressed up in research-sounding terminology.

In development, government and philanthropy, there is increasing recognition of the potential value of impact evaluation.

Bamberger M (2012).  Introduction to Mixed Methods in Impact Evaluation.   Guidance Note  No. 3. Washington DC: InterAction. See:  https://www.interaction.org/blog/impact-evaluation-guidance-note-and-webinar-series/

Funnell S and Rogers P (2012).  Purposeful Program Theory: Effective Use of Logic Models and Theories of Change . San Francisco: Jossey-Bass/Wiley.

OECD-DAC (1991).  Principles for  Evaluation of Development Assistance.  Paris: Organisation for Economic Co-operation and Development – Development Assistance Committee (OECD-DAC). See:  http://www.oecd.org/dac/evaluation/50584880.pdf

OEDC-DAC (2010).  Glossary of Key Terms in Evaluation and Results Based Management . Paris: Organisation for Economic Co-operation and Development – Development Assistance Committee (OEDC-DAC). See:  http://www.oecd.org/development/peer-reviews/2754804.pdf

OECD-DAC (accessed 2015). Evaluation of development programmes. DAC Criteria for Evaluating Development Assistance. Organisation for Economic Co-operation and Development – Development Assistance Committee (OECD-DAC). See:  http://www.oecd.org/dac/evaluation/daccriteriaforevaluatingdevelopmentassistance.htm

Stufflebeam D (2001). Evaluation values and criteria checklist. Kalamazoo: Western Michigan University Checklist Project. See:  https://www.dmeforpeace.org/resource/evaluation-values-and-criteria-checklist/

UNEG (2013).  Impact Evaluation in UN Agency Evaluation Systems: Guidance on Selection, Planning and Management .  Guidance Document . New York: United Nations Evaluation Group (UNEG) . See:  http://www.uneval.org/papersandpubs/documentdetail.jsp?doc_id=1434

Expand to view all resources related to 'Impact evaluation'

  • Introduction to impact evaluation
  • Linking monitoring and evaluation to impact evaluation
  • Sinopsis de la evaluación de impacto
  • Addressing gender in impact evaluation
  • Assessing rural transformations: Piloting a qualitative impact protocol in Malawi and Ethiopia
  • Attributing development impact: The qualitative impact protocol (QuIP) case book
  • Bath social & developmental research ltd. (BSDR) website
  • Broadening the range of designs and methods for impact evaluations
  • Case study: QuIP & RCT to evaluate a cash transfer and gender training programme in Malawi
  • Cases in outcome harvesting
  • Causal link monitoring brief
  • Clearing the fog: New tools for improving the credibility of impact claims
  • Como elaborar modelo lógico:roteiro para formular programas e organizar avaliação
  • Comparative case studies
  • Comparing QuIP with thirty other approaches to impact evaluation
  • Contribution analysis in policy work: Assessing advocacy’s influence
  • Contribution analysis: A promising method for assessing advocacy's impact
  • Designing impact evaluations: Different perspectives
  • Designing quality impact evaluations under budget, time and data constraints
  • Developing and selecting measures of child well-being
  • Diseño de evaluaciones de impacto: Perspectivas diversas
  • Does our theory match your theory? Theories of change and causal maps in Ghana
  • Estudo de caso: a avaliação externa de um programa
  • Evaluability assessment for impact evaluation
  • Evaluations that make a difference
  • Evaluative criteria
  • Evaluative reasoning
  • Finding and using causal hotspots: A practice in the making
  • From narrative text to causal maps: QuIP analysis and visualisation
  • Impact evaluation in practice
  • Impact evaluation toolkit
  • Impact evaluation: A guide for commissioners and managers
  • Impact evaluation: How to institutionalize evaluation
  • Impact evaluations and development
  • Institutionalizing evaluation: A review of international experience
  • Interviewing
  • Introducción a la evaluación de impacto
  • Introduction to mixed methods in impact evaluation
  • Introduction to randomized control trials
  • Introduction à l’évaluation d’impact
  • Learning through and about contribution analysis for impact evaluation
  • Méthodologie de l’évaluation d’impact : présentation de différentes approches
  • O sistema de monitoramento e avaliação dos programas de promoção e proteção social do Brasil
  • Overview of impact evaluation
  • Overview: Data collection and analysis methods in impact evaluation
  • Overview: Strategies for causal attribution
  • Participatory approaches
  • Participatory impact assessment: A design guide
  • Process tracing and contribution analysis: A combined approach to generative causal inference for impact evaluation
  • Prosaic or profound? The adoption of systems ideas by impact evaluation
  • Présentation de l'évaluation d’impact
  • Présentation des méthodes de collecte et d'analyse de données dans l'évaluation d'impact
  • Présentation des stratégies d'attribution causale
  • QuIP and the Yin/Yang of Quant and Qual: How to navigate QuIP visualisations
  • QuIP used as part of an evaluation of the impact of the UK Government Tampon Tax Fund (TTF)
  • QuIP: Understanding clients through in-depth interviews
  • Qualitative impact assessment protocol (QuIP)
  • Quantitative and qualitative methods in impact evaluation and measuring results
  • Quasi-experimental design and methods
  • Quasi-experimental methods for impact evaluations
  • Randomised control trials for the impact evaluation of development initiatives: a statistician's point of view
  • Randomized controlled trials (RCTs)
  • Randomized controlled trials (RCTs) video guide
  • Realist impact evaluation: An introduction
  • Sinopsis: estrategias de atribución causal
  • Sinopsis: métodos de recolección y análisis de datos en la evaluación de impacto
  • Systematic reviews
  • The importance of a methodologically diverse approach to impact evaluation
  • The theory of change
  • Theory of change
  • Tools and tips for implementing contribution analysis
  • UNICEF Impact Evaluation series
  • UNICEF webinar: Comparative case studies
  • UNICEF webinar: Overview of data collection and analysis methods in Impact Evaluation
  • UNICEF webinar: Overview of impact evaluation
  • UNICEF webinar: Overview: strategies for causal inference
  • UNICEF webinar: Participatory approaches in impact evaluation
  • UNICEF webinar: Quasi-experimental design and methods
  • UNICEF webinar: Randomized controlled trials
  • UNICEF webinar: Theory of change
  • Using evidence to inform policy
  • What is impact evaluation?
  • مقدمة لتقييم الأثر
  • 关于影响评估的设计: 不同的视角

'Impact evaluation' is referenced in:

  • 52 weeks of BetterEvaluation: Week 44: How can monitoring data support impact evaluations?
  • Impact evaluation: challenges to address

Framework/Guide

  • Rainbow Framework :  Investigate possible alternative explanations

Back to top

© 2022 BetterEvaluation. All right reserved.

Designing an impact evaluation work plan: a step-by-step guide

May 4, 2021.

This article is the second part of our 2-part series on impact evaluation. In the first article, “ Impact evaluation: overview, benefits, types and planning tips,”  we introduced impact evaluation and some helpful steps for planning and incorporating it into your M&E plan. 

In this blog, we will walk you through the next steps in the process – from understanding the core elements of an impact evaluation work plan to designing your own impact evaluation to  identify the real difference your interventions are making on the ground . Elements in the work plan include but are not limited to – the purpose, scope and objectives of the evaluation,  key evaluation questions, designs and methodologies and more. Stay with us as we deep dive into each element of the impact evaluation work plan!

Key elements in an impact evaluation work plan

Developing an appropriate evaluation design and work plan is critically important in impact evaluation. Evaluation work plans are also called terms of reference (ToR) in some organisations. While the format of an evaluation design may vary on a case by case basis, it must always include some essential elements, including:

  • Background and context
  • The purpose, objectives and scope of the evaluation
  • Theory of change (ToC)
  • Key evaluation questions the evaluation aims to answer
  • Proposed designs and methodologies
  • Data collection methods 
  • Specific deliverables and timelines

1. Background and context

This section provides information on the background of the intervention to be evaluated. The description should be concise and kept under one page and focus only on the issues pertinent for the evaluation – the intended objectives of the intervention, the timeframe and the progress achieved at the moment of the evaluation, key stakeholders involved in the intervention, organisational, social, political and economic factors which may have an influence on the intervention’s implementation etc.

2. Defining impact evaluation purpose, objectives and scope

Consultation with the key stakeholders is vital to determine the purpose, objectives and scope of the evaluation and identify some of its other important parameters. 

The evaluation purpose refers to the rationale for conducting an impact evaluation. Evaluations that are being undertaken to support learning should be clear about who is intended to learn from it, how they will be engaged in the evaluation process to ensure it is seen as relevant and credible, and whether there are specific decision points around where this learning is expected to be applied. Evaluations that are being undertaken to support accountability should be clear about who is being held accountable, to whom and for what.  

The objective of impact evaluation reflects what the evaluation aims to find out. It can be to measure impact and to analyse the mechanisms producing the impact. It is best to have no more than 2-3 objectives, that way the team can explore few issues in depth rather than examine a broader set superficially.

The scope of the evaluation includes the time period, the geographical and thematic coverage of the evaluation, the target groups and the issues to be considered. The scope of the evaluation must be realistic given the time and resources available. Specifying the evaluation scope enables clear identification of the implementing organisation’s expectations and of the priorities that the evaluation team must focus on in order to avoid wasting its resources on areas of secondary interest. The central scope is usually specified in the work plan or the terms of reference (ToR) and the extended scope in the inception report.

3. Theory of change (ToC)

Theory of change  (ToC)  or project framework is a vital building block for any evaluation work and every evaluation should begin with one. A ToC may also be represented in the form of a logic model or a results framework. It illustrates project goals, objectives, outcomes and assumptions underlying the theory and explains how project activities are expected to produce a series of results that contribute to achieving the intended or observed project objectives and impacts. 

A ToC also identifies which aspects of the interventions should be examined, what contextual factors should be addressed, what the likely intermediate outcomes will be and how the validity of the assumptions will be tested. Plus, a ToC explains what data should be gathered and how it will be synthesized to reach justifiable conclusions about the effectiveness of the intervention. Alternative causal paths and major external factors influencing outcomes may also be identified in a project theory. 

A ToC also helps to identify gaps in logic or evidence that the evaluation should focus on, and provides the structure for a narrative about the value and impact of an intervention. All in all, a ToC helps the project team to determine the best impact evaluation methods for their intervention. ToCs should be reviewed and revised on a regular basis and kept up to date at all stages of the project lifecycle – be this at project design, implementation, delivery, or close.

More on the theory of change, logic model and results framework.

4. Key impact evaluation questions

Impact evaluations should be focused on key evaluation questions that reflect the intended use of the evaluation. Impact evaluation will generally answer three types of questions: descriptive, causal or evaluative. Each type of question can be answered through a combination of different research designs and data collection and analysis mechanisms.  

  • Descriptive questions ask about how things were and how they are now and what changes have taken place since the intervention. 
  • Causal questions ask what produced the changes and whether or not, and to what extent, observed changes are due to the intervention rather than other factors.
  • Evaluative questions ask about the overall value of the intervention, taking into account intended and unintended impacts. It determines whether the intervention can be considered a success, an improvement or the best option.

Examples of key evaluation questions for impact evaluation based on the OECD-DAC evaluation criteria.

5. Impact evaluation design and methodologies

Measuring direct causes and effects can be quite difficult, therefore, the choice of methods and designs for impact evaluation of interventions is not straightforward, and comes with a unique set of challenges. There is no one right way to undertake an impact evaluation, discussing all the potential options and using a combination of different methods and designs that suit a particular situation must be considered. 

Generally, the evaluation methodology is designed on the basis of how the key descriptive, causal and evaluative evaluation questions will be answered, how data will be collected and analysed, the nature of the intervention being evaluated, the available resources and constraints and the intended use of the evaluation. 

The choice of the methods and designs also depend on causal attribution, including whether there is a need to form comparison groups and how it will be constructed. In some cases, quantifying the impacts of interventions requires estimating the counterfactual – meaning, estimating what would have happened to the beneficiaries in the absence of the intervention? But in most cases, mixed-method approaches are recommended as they build on qualitative and quantitative data and make use of several methodologies for analysis.

In all types of evaluations, it is important to dedicate sufficient time to develop a sound evaluation design before any data collection or analysis begins. The proposed design must be reviewed at the beginning of the evaluation and it must be updated on a regular basis – this helps to manage the quality of evaluation throughout the entire project cycle. Plus, engaging with a broad range of stakeholders and following established ethical standards and using the evaluation reference group to review evaluation design and draft reports all contribute to ensuring the quality of evaluation.

Descriptive Questions

In most cases, an effective combination of quantitative and qualitative data will provide a more comprehensive picture of what changes have taken place since the intervention. Data collection options include, but are not limited to interviews, questionnaires, structured or unstructured and participatory or non-participatory observations recorded through notes, photos or video; biophysical measurements or geographical information and existing documents and data, including existing data sets, official statistics, project records, social media data and more.

Causal Questions

Answering causal questions require a research design that addresses “attribution” and “contribution.” Attribution means the changes observed are entirely caused by the intervention and contribution means that the intervention partially caused or contributed to the changes. In practice, it is quite complex for an organisation to fully claim attribution to a change, this is because changes within the community are likely to be the result of a mix of different factors besides just the effects of the intervention, such as changes in economic and social environments, national policy etc. 

The design for answering causal questions could be ‘experimental,’ ‘quasi-experimental’ or ‘non-experimental.’ Let’s take a look at each design separately:

Experimental: involves the construction of a control group through random assignment of participants. Experimental designs can produce highly credible impact estimates but are often expensive and for certain interventions, difficult to implement. Examples of experimental designs include:

  • Randomized controlled trial (RCT) – In this type of experiment, two groups, a treatment group and a comparison group are created and participants for each group are picked randomly. The two groups are statistically identical, in terms of both observed and unobserved factors before the intervention but the group receiving treatment will gradually show changes as the project progresses. Outcome data for comparison and treatment groups and baseline data and background variables are helpful in determining the change.

Quasi experimental: unlike experimental design, quasi experimental design involves construction of a valid comparison group through matching, regression discontinuity, propensity scores or other statistical means to control and measure the differences between the individuals treated with the intervention being evaluated and those not treated. Examples of quasi-experimental designs include, 

  • Difference-in-differences: this measures improvement or change over time of an intervention’s participants relative to the improvement or change of non-participants.
  • Propensity score matching: Individuals in the treatment group are matched with non-participants who have similar observable characteristics. The average difference in outcomes between matched individuals is the estimated impact. This method is based on the assumption that there is no unobserved difference in the treatment and comparison group.
  • Matched comparisons: this design compares the differences between participants of an intervention being evaluated with the non participants after the intervention is completed.
  • Regression discontinuity: in this design, individuals are ranked based on specific, measurable criteria. There is usually a cut-off point to determine who is eligible to participate. Impact is measured by comparing outcomes of participants and non-participants close to the cutoff line. Outcomes as well as data of ranking criteria, e.g. age, index, etc. and data on socioeconomic background variables are used.

Non-experimental: when experimental and quasi-experimental designs are not possible, we can conduct non-experimental designs for impact evaluation. This design takes a systematic look at whether the evidence is consistent with what would be expected if the intervention was producing the impacts, and also whether other factors could provide an alternative explanation.

  • Hypothetical and logical counterfactuals : it is basically an estimate of what would have happened in the absence of an intervention. It involves consulting with key informants to identify either a hypothetical counterfactual, meaning what they think would have happened in the absence of an intervention or a logical counterfactual, meaning what would logically have happened in its absence.
  • Qualitative comparative analysis: this design is particularly useful where there are a number of different ways of achieving positive impacts, and where data can be iteratively gathered about a number of cases to identify and test patterns of success.

Evaluative Questions

To answer these questions one needs to identify criteria against which to judge the evaluation results and decide how well the intervention performed overall or how successful or unsuccessful an intervention was. This includes determining what level of impact from the intervention will count as significant. Once the appropriate data are gathered, the results will be judged against the evaluative criteria. 

For this type of evaluation, you should have a clear understanding of what indicates ‘success’ – is it represented as improvement in quality or value? One way to find out is by using a specific rubric that defines different levels of performance for each evaluative criterion, deciding what evidence will be gathered and how it will be synthesized to reach defensible conclusions about the worth of the intervention.

These are just a handful of commonly used impact evaluation methodologies in international development, to explore more methodologies, check out the  Australian Government’s guidelines on “Choosing Appropriate Designs and Methods for Impact Evaluation.“

6. Data collection methods for impact evaluation

According to BetterEvaluation, well-chosen and well-implemented methods for data collection and analysis are essential for all types of evaluations and must be specified during the evaluation planning stage. One should have a clear understanding of the objectives and assumptions of the intervention, what baseline data exist and are available for use and what new data needs to be collected, how frequently, in what form, and what data do the beneficiaries need to deliver etc. 

Reviewing the key evaluation questions can help to determine which data collection and analysis method can be used to answer each question and which data collection tools can be leveraged to gather all the necessary information. Sources for data can be stakeholder interviews, project documents, survey data, meeting minutes, and statistics, among others. 

However, many outcomes of a development intervention are complex and multidimensional and may not be captured with just one method. Therefore, using a combination of both qualitative and quantitative data collection methods, which is also called a mixed-methods approach is highly recommended as it allows us to combine the strengths and counteract the weaknesses of both qualitative and quantitative evaluation tools, allowing for a stronger evaluation design overall and provides a better understanding of the dynamics and results of the intervention.

But how do you know which method is right for you? 

It is a good idea to consider all possible impact evaluation methods and to carefully weigh advantages and disadvantages before making a choice(s). The methods you select must be credible, useful and cost effective in producing the information that is important for your intervention. As mentioned above, many impact evaluation uses mixed methods, which is a combination of qualitative and quantitative methods. Each method’s shortcomings can be fulfilled by using it in combination with other methods. Using a combination of different methods also helps to increase the credibility of evaluation findings as information from different data sources are converged, likewise, it can also help the team to gain a deeper understanding of the intervention, its effects and context.

7. Impact evaluation deliverables and timelines

Deliverables include an ‘inception report,’ a ‘ draft report’ and the ‘final evaluation report’ but in case of complex evaluations, ‘monthly progress reports’ might also be required.  These reports contain detailed descriptions of the methodology that will be used to answer the evaluation questions, as well as the proposed source of information and data collection procedure. These reports must also indicate the detailed schedule for the tasks to be undertaken, the activities to be implemented and the deliverables, plus, clarification on the role and responsibilities of each member of the evaluation team.

We hope you found this article helpful. Our intention behind the 2-part series was to explain impact evaluations and its key components in a simple manner so that you can plan and implement your own impact evaluation more accurately and effectively. 

Before we sign off, just a quick reminder that this list is not all-inclusive but rather a list of few key elements that many organisations choose to include in their impact evaluation work plan or ToR. If you know of any additional elements that are included in an evaluation work plan in your organisation then do reach out to us and we’d be happy to add them here. 

This article is partly based on the Methodological brief “Overview of Impact Evaluation,” by Patricia Rogers at UNICEF, 2014.

Additional Resources:

  • Outline of Principles of Impact Evaluation – OECD
  • Technical Note on Impact Evaluation – USAID
  • Impact Evaluation – BetterEvaluation

By Chandani Lopez Peralta, Content Marketing Manager at TolaData.

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

  • Data Collection
  • Data Management
  • Indicator Tracking
  • Indicator Aggregation
  • Custom Solutions
  • IATI Reporting
  • TolaData Partners
  • Client Testimonials
  • Help Center
  • Quick Start Guide
  • Knowledge Base
  • Release Notes
  • Case Studies
  • Feature Focus

© TolaData 2024

Register and start your 14-day free trial.

(no credit card needed)

Subscribe to our newsletter.

impact evaluation research proposal

The TolaBrief Newsletter

A monthly round-up of news and useful links on the digitisation  of the sustainable development sector, from the team at TolaData

World Bank Blogs Logo

Tips for writing Impact Evaluation Grant Proposals

David mckenzie.

Recently I’ve done more than my usual amount of reviewing of grant proposals for impact evaluation work – both for World Bank research funds and for several outside funders. Many of these have been very good, but I’ve noticed a number of common issues which have cropped up in reviewing a number of them – so thought I’d share some pet peeves/tips/suggestions for people preparing these types of proposals.

First, let me note that writing these types of proposals is a skill that takes work and practice. One thing I lacked as an assistant professor at Stanford was experienced senior colleagues to encourage and give advice on grant proposals- and it took a few attempts before I was successful in getting funding. Here are some things I find a lot of proposals lack:

·          Sufficient detail about the intervention – details matter both for understanding whether this is an impact evaluation that is likely to be of broader interest, as well as for understanding what the right outcomes to be measuring are and what the likely channels of influence are. So don’t just say you are evaluating a cash transfer program – I want to know what the eligibility criteria are, what the payment levels are, the duration of the program, etc.

·          Clearly stating the main equations to be estimated – including what the main outcomes are, and what your key hypotheses are.

·          Sufficient detail about measurement of key outcomes – especially true if your outcomes are indices or outcomes where multiple alternate measures are possible. E.g. if female empowerment is an outcome, you need to tell us how this will be measured. If you want to look at treatment heterogeneity by risk aversion, how will you measure this?

·          How will you know why it hasn’t worked if it doesn’t work – a.k.a. spelling out mechanisms and a means to test them – e.g. if you are looking at a business training program, you might not find an effect because i) people don’t attend; ii) they attend but don’t learn anything; iii) they learn material but then don’t implement it in their businesses; iv) they implement practices in their businesses but implementing these practices has no effect; etc. While we all hope our interventions have big detectable effects, we also want impact evaluations to be able to explain why it didn’t work if somehow there is no effect.

·          Discussion of timing of follow-up: are you planning multiple follow-up rounds? If only one round, why did you choose one year as the follow-up survey date – is this really the most important follow-up period of interest for policy and theory?

·          Discuss what you expect survey response rates to be, and what you will do about attrition. Do you have evidence from other similar surveys of what likely response rates are like? Do you have some administrative data you can use to provide more details on attritors, or will you be using a variety of different survey techniques to reduce attrition? If so, what will these be?

·          Power calculations: it is not enough to say “power calculations suggest a sample of 800 in each group will be sufficient” – you should provide sufficient detail on assumed means and standard deviations, assumed autocorrelations (and for cluster trials, intra-cluster correlations) that a reviewer should be able to replicate these power calculations and test their sensitivity to different assumptions.

·          A detailed budget narrative: Don’t just say survey costs are $200,000, travel is $25,000. Price out flights etc, describe the per survey costs and explain why this budget is reasonable.

·          Tell the reviewers why it is likely you will succeed. There is a lot to do to successfully pull off all the steps in a successful impact evaluation, and even if researchers do everything they can, it is inherently a risky business trying to evaluate policies that are subject to so many external forces. So researchers who have a track record of taking previous impact evaluations through to completion and publication should make clear this experience. But if this is your first impact evaluation, you need to provide some detail for the reviewers as to what makes it likely you will succeed – have you previously done fieldwork as an RA? Have you attended some course or clinic to help you design an evaluation? Do you have some senior mentors attached to your project? Are you asking first for money for a small pilot to prove you can at least carry out some key step? Make the case that you know what you are doing. Note this shouldn’t just be lines on a C.V., but some description in the proposal itself of the qualifications of your team.

Note I haven’t commented above about links to policy or explicit tests of theory. Obviously you should discuss both, but depending on the funder and their interests, one or the other becomes relatively more important. One concern I have with several grant agencies is how they view policy impact – I’m sympathetic to the view that what may be most useful for informing policy in many cases is not to test the policies themselves but to test underlying mechanisms behind their policies (see a post on this here ). So this might involve funding researchers to conduct interventions that are never themselves going to be implemented as policies, but which tell us a lot about how certain policies might or might not work. I think such studies should be scored just as strongly on policy criteria as some studies which look at explicit programs.

For those interested in seeing some examples of good proposals, 3ie has several successful proposals in the right menu here. Anyone else got any pet-peeves they come across when reviewing proposals, or must dos? Those on the grant preparation side, any questions or puzzles you would like to see if our readership has answers for?

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

  • Share on mail
  • comments added

Impact Evaluation in Practice - Second Edition

The World Bank

The second edition of the Impact Evaluation in Practice handbook is a comprehensive and accessible introduction to impact evaluation for policymakers and development practitioners. First published in 2011, it has been used widely across the development and academic communities. The book incorporates real-world examples to present practical guidelines for designing and implementing impact evaluations. Readers will gain an understanding of impact evaluation and the best ways to use impact evaluations to design evidence-based policies and programs. The updated version covers the newest techniques for evaluating programs and includes state-of-the-art implementation advice, as well as an expanded set of examples and case studies that draw on recent development challenges. It also includes new material on research ethics and partnerships to conduct impact evaluation. The handbook is divided into four sections: Part One discusses what to evaluate and why; Part Two presents the main impact evaluation methods; Part Three addresses how to manage impact evaluations; Part Four reviews impact evaluation sampling and data collection. Case studies illustrate different applications of impact evaluations. The book links to complementary instructional material available online, including an applied case as well as questions and answers. The updated second edition will be a valuable resource for the international development community, universities, and policymakers looking to build better evidence around what works in development.

Editor’s Note: The PowerPoints referenced in the book are unfortunately unavailable. 

This book is a product of the World Bank Group and the Inter-American Development Bank .

DOWNLOAD IMPACT EVALUATION IN PRACTICE

  • The Strategic Impact Evaluation Fund (SIEF)

Book Content

  • Show More +
  • Chapter 8: Matching
  • Chapter 9: Addressing Methodological Challenges
  • Chapter 10: Evaluating Multifaceted Programs
  • Chapter 11: Choosing an Impact Evaluation Method
  • Chapter 12: Managing an Impact Evaluation
  • Chapter 13: The Ethics and Science of Impact Evaluation
  • Chapter 14: Disseminating Results and Achieving Policy Impact
  • Chapter 15: Choosing a Sample
  • Chapter 16: Finding Adequate Sources of Data
  • Chapter 17: Conclusion
  • Show Less -

Email for further questions. Email

This site uses cookies to optimize functionality and give you the best possible experience. If you continue to navigate this website beyond this page, cookies will be placed on your browser. To learn more about cookies, click here .

U.S. flag

An official website of the United States government

Here's how you know

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. A lock ( ) or https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Keyboard Navigation

  • Agriculture and Food Security
  • Anti-Corruption
  • Conflict Prevention and Stabilization
  • Democracy, Human Rights, and Governance
  • Economic Growth and Trade
  • Environment, Energy, and Infrastructure
  • Gender Equality and Women's Empowerment
  • Global Health
  • Humanitarian Assistance
  • Innovation, Technology, and Research
  • Water and Sanitation
  • Burkina Faso
  • Central Africa Regional
  • Central African Republic
  • Côte d’Ivoire
  • Democratic Republic of the Congo
  • East Africa Regional
  • Power Africa
  • Republic of the Congo
  • Sahel Regional
  • Sierra Leone
  • South Africa
  • South Sudan
  • Southern Africa Regional
  • West Africa Regional
  • Afghanistan
  • Central Asia Regional
  • Indo-Pacific
  • Kyrgyz Republic
  • Pacific Islands
  • Philippines
  • Regional Development Mission for Asia
  • Timor-Leste
  • Turkmenistan
  • Bosnia and Herzegovina
  • North Macedonia
  • Central America and Mexico Regional Program
  • Dominican Republic
  • Eastern and Southern Caribbean
  • El Salvador
  • Middle East Regional Platform
  • West Bank and Gaza
  • Dollars to Results
  • Data Resources
  • Strategy & Planning
  • Budget & Spending
  • Performance and Financial Reporting
  • FY 2023 Agency Financial Report
  • Records and Reports
  • Budget Justification
  • Our Commitment to Transparency
  • Policy and Strategy
  • How to Work with USAID
  • Find a Funding Opportunity
  • Organizations That Work With USAID
  • Resources for Partners
  • Get involved
  • Business Forecast
  • Safeguarding and Compliance
  • Diversity, Equity, Inclusion, and Accessibility
  • Mission, Vision and Values
  • News & Information
  • Operational Policy (ADS)
  • Organization
  • Stay Connected
  • USAID History
  • Video Library
  • Coordinators
  • Nondiscrimination Notice and Civil Rights
  • Collective Bargaining Agreements
  • Disabilities Employment Program
  • Federal Employee Viewpoint Survey
  • Reasonable Accommodations
  • Urgent Hiring Needs
  • Vacancy Announcements
  • Search Search Search

Impact Evaluation Designs

ADS 201 requires that each Mission and Washington OU must conduct an impact evaluation, if feasible, of any new, untested approach that is anticipated to be expanded in scale or scope through U.S. Government foreign assistance or other funding sources (i.e., a pilot intervention). Pilot interventions should be identified during project or activity design, and the impact evaluation should be integrated into the design of the project or activity. If it is not feasible to effectively undertake an impact evaluation, the Mission or Washington OU must conduct a performance evaluation and document why an impact evaluation wasn’t feasible.

Missions initially identify which of their evaluations during a CDCS period will be impact evaluations and which will be performance evaluations in their PMP. This toolkit’s page on the  decision to undertake and impact evaluation  is located in that section and may be worth reviewing as Mission’s prepare more detailed Project MEL plans.

USAID Evaluation Policy encourages Missions to undertake  prospective impact evaluations  that involve the identification of a comparison group or area, and the collection of baseline data, prior to the initiation of the project intervention. This type of impact evaluation can potentially be employed whenever an intervention is delivered to some but not all members of a population, i.e., some but not all firms engaged in exporting, or some but not all farms that grow a particular crop. This type of design may also be feasible when USAID projects introduce an intervention on a phased basis.

The identification of a valid comparison group is critical for impact evaluations. In principle, the group or area that receives an intervention should be equivalent to the group or area that does not. The more certain we are that groups are equivalent at the start, the more confident we can be in claiming that any post-intervention difference is due to the project being evaluated. For this reason, USAID evaluation policy prefers a method for selecting a comparison group that is called randomized assignment, as this method for constructing groups that do and do not receive an intervention is more effective than any other when it comes to ensuring that groups are equivalent on a pre-intervention basis.

Prospective impact evaluations that employ randomized assignment are classified as having  experimental designs  (also called  randomized controlled trials (RCTs) . Any other method of assigning members of a population to treatment and comparison groups, no matter how elaborate or carefully developed, involves decisions by evaluators about the basis on which population members will be assigned to the treatment or comparison groups. Evaluator involvement in these decisions automatically steps away from the 'equal chance' proposition and introduces the possibility of bias (either deliberate or inadvertent) in the assignment process. This results in an impact evaluation that is classified as having a  quasi-experimental design . Both experimental and quasi-experimental designs can produce credible impact evaluation findings, but there is a difference, and their classifications signal what that difference is. Along this continuum, the preference in USAID's Evaluation Policy is clear:

For impact evaluations, experimental methods generate the strongest evidence. Alternative methods should be utilized only when random assignment strategies are infeasible.

Whenever a prospective impact evaluation involving treatment and comparison groups is being considered, it is wise to undertake a  power analysis  to ensure that the number of units (people, locations) available for assignment to these groups is adequate to detect important differences between them.

In addition to prospective impact evaluations in which equivalent or close to equivalent groups are established prior to an intervention and followed to determine their post intervention status on outcome or effect measures of interest there are other types of impact evaluations what are useful under specific conditions, including designs that can be used in situations where all members of a population are exposed to a treatment, such as a policy reform, and which in some instance will be undertaken as  retrospective impact evaluations . Other impact evaluation designs, are intended for use when populations that are known to be different (such as those living above and below a poverty line) are to be compared after the poorer group receives an intervention that is designed to improve their circumstances. These as well as other impact evaluation designs used in specialized circumstances are described in  Impact Evaluation in Practice  as well as in  Experimental and Quasi Experimental Design for Generalized Causal Inference , as well as in other publications in this field.

When considering which type of evaluation design to choose, project design teams may find that using an  Evaluation Design Decision Tree  helps them work their way to the most appropriate option.

8+ SAMPLE Impact Evaluation Proposal in PDF

Impact evaluation proposal, 8+ sample impact evaluation proposal, what is an impact evaluation proposal, different types of impact evaluations, basic elements of an impact evaluation proposal, how to create an impact evaluation proposal, what are some examples of impact evaluation proposals, what is the purpose of an impact evaluation proposal, what are the tools used in impact evaluation, what is the difference between impact evaluation and outcome evaluation, what are the benefits of impact evaluation.

Impact Evaluation Study Proposal

Impact Evaluation Study Proposal

Experimental Impact Evaluation Proposal

Experimental Impact Evaluation Proposal

Impact Evaluation Proposal Example

Impact Evaluation Proposal Example

Preschool Program Impact Evaluation Proposal

Preschool Program Impact Evaluation Proposal

Impact Evaluation of Production Efficiency Program Proposal

Impact Evaluation of Production Efficiency Program Proposal

Policy Impact Evaluation Proposal

Policy Impact Evaluation Proposal

Strategic Impact Evaluation Fund Proposal

Strategic Impact Evaluation Fund Proposal

Impact Evaluation Call For Proposal

Impact Evaluation Call For Proposal

Impact Evaluation Technical Proposal

Impact Evaluation Technical Proposal

1. health impact assessment , 2. environmental impact evaluation, 3. social impact evaluation , 4. equality impact evaluation , step 1: write a simple overview of the impact evaluation, step 2: explain the impact evaluation tools, resources, and methods , step 3: indicate the timeline and the processes of the evaluation, step 4:   proofread and revise the overall proposal, step 5: prepare the final impact evaluation proposal, share this post on your network, file formats, word templates, google docs templates, excel templates, powerpoint templates, google sheets templates, google slides templates, pdf templates, publisher templates, psd templates, indesign templates, illustrator templates, pages templates, keynote templates, numbers templates, outlook templates, you may also like these articles, 25+ sample construction company proposal in ms word.

sample construction company proposal

Navigating the intricate world of construction demands a seasoned company with a proven track record. Our comprehensive guide on the Construction Company Proposal is your blueprint to understanding the…

8+ SAMPLE Drama Proposal in PDF

sample drama proposal

Julia Child said: “Drama is very important in life: You have to come on with a bang. You never want to go out with a whimper. Everything can have…

browse by categories

  • Questionnaire
  • Description
  • Reconciliation
  • Certificate
  • Spreadsheet

Information

  • privacy policy
  • Terms & Conditions
  • Business Templates
  • Sample Proposals

FREE 6+ Impact Evaluation Proposal Samples in PDF

impact evaluation proposal samples

6+ Impact Evaluation Proposal Samples

Impact assessments frequently include an accountability function to determine the effectiveness of a program. Impact analyses can also assist in determining the most effective course of action among several possibilities when it comes to program design. Impact evaluation is often used for summative objectives. Ideally, a summative impact evaluation does not just generate data about ‘what works’ but also gives knowledge about what is needed to make the intervention effective for diverse populations in varied circumstances. In order to effectively evaluate an impact, an adequate  evaluation design and work proposal must be created. In certain organizations, evaluation work proposals are also known as terms of reference. Need some help with creating one? We’ve got you covered! In this article, we provide you with free and ready-made samples of Impact Evaluation Proposals in PDF and DOC format that you could use for your benefit. Keep on reading to find out more!

Impact Evaluation Proposal

Mpact evaluation proposal, mpact evaluation proposal example, what is a impact evaluation, how to make an impact evaluation proposal, 1. give the background and setting., 2. identify the scope, objectives, and purpose of the impact evaluation.., 3. list the main inquiries for the effect evaluation., 4. design and implement strategies for impact evaluation., what can impact analysis serve as a tool for, what does project management impact evaluation entail, why are impact studies such an important tool for decision-makers, 1. i mpact evaluation proposal.

impact evaluation proposal

2. Technical I mpact Evaluation Proposal

technical impact evaluation proposal

Size: 825 KB

3. Project I mpact Evaluation Proposal

project impact evaluation proposal

Size: 203 KB

4. Experimental  Impact Evaluation Proposal

experimental impact evaluation proposal

Size: 111 KB

5. Sample I mpact Evaluation Proposal

sample impact evaluation proposal

Size: 696 KB

6. Simple I mpact Evaluation Proposal

simple impact evaluation proposal

Size: 614 KB

7. I mpact Evaluation Proposal Example

impact evaluation proposal example

Size: 27 KB

A rigorous methodology is used in an impact evaluation to identify the changes in outcomes that can be directly linked to a particular intervention using cause-and-effect analysis. Impact assessments must incorporate the counterfactual – what would have happened in the absence of the intervention – by utilizing an experimental or quasi-experimental design with comparison and treatment groups. An strategy of evaluating the impact of an intervention that compares the outcomes of an experimental group or groups with a control group that was randomly assigned in order to determine the mean net impact of the intervention.

The goal, scope, and objectives of the assessment , the main evaluation questions, designs and procedures, and more are all included in the work proposal. An Impact Evaluation Proposal Template can help provide you with the framework you need to ensure that you have a well-prepared and robust proposal on hand. To do so, you can choose one of our excellent templates listed above. If you want to write it yourself, follow these steps below to guide you:

Information on the history of the intervention under evaluation is provided in this section. The description should be succinct, limited to one page, and concentrate only on the evaluation’s key issues, such as the intervention’s intended goals, its timeline and the progress made as of the evaluation, its major participants, and any organizational, social, political, and economic factors that might affect how it is carried out.

The aim, objectives, and scope of the assessment, as well as some of its other crucial criteria, must be decided upon through consultation with the main stakeholders. What the evaluation is trying to find out is reflected in the research and evaluation goal. It may be done to assess the impact and examine the causes causing it.

Impact evaluations have to concentrate on the main evaluation issues that correspond to the evaluation’s stated purpose. Impact assessments often provide answers to three different sorts of queries: descriptive, causal, or evaluative. Each sort of issue may be addressed by combining various research methodologies, data gathering techniques, and analytic techniques.

The choice of methodologies and designs for impact evaluation of interventions is not simple and comes with a unique set of obstacles since measuring direct causes and effects may be rather challenging. There is no one proper approach to conduct an impact evaluation; instead, all available possibilities should be discussed, and a variety of techniques and designs should be combined depending on the circumstances.

Analyzing the effects of modifications to the deployed product or application is the definition of impact analysis. It provides details on the parts of the system that may be impacted by changes to a certain portion or set of an application’s features.

In order to identify the changes in outcomes that may be directly linked to a particular intervention based on cause-and-effect analysis, impact evaluations rely on rigorous methodologies.

It may swiftly produce results that are customized, supporting decisions in real time and increasing the likelihood that research findings will influence and enhance development decisions.

Overall, this concept enables you to recognize crucial business processes and foresee the effects that an interruption of one of those processes might have. Additionally, it enables us to compile the data required to create recovery plans and reduce possible losses. To help you get started, download our easily customizable and comprehensive samples of Impact Evaluation Proposals today!

Related Posts

Free 10+ research implementation plan samples in pdf ms word, free 7+ sample project scope templates in pdf ms word, free 10+ educational impact statement samples in pdf doc, free 9+ sample it proposal templates in ms word pdf | google ..., free 9+ vendor evaluation form samples & templates in pdf, free 9+ event evaluation samples in pdf ms word | excel, free 10+ grant proposal problem statement samples in pdf doc, free 10+ dissertation evaluation samples [ critical, service, self ], free 10+ sample policy proposal templates in pdf google docs ..., free 10+ low-performing block grant plan samples in pdf, free 3+ research funding proposal samples in pdf, free 10+ informative thesis statement samples [ speech, paper ..., free 10+ liability risk assessment samples and templates in pdf, free 10+ technical compliance statement samples in pdf doc, free 10+ educational needs assessment samples [ special ..., free 10+ environmental project proposal samples [ education ..., free 20+ sample project proposals in pdf ms word | pages ..., free 10+ project manager proposal samples [ construction ..., free 10+ research paper proposal samples in ms word pdf.

  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Science and Public Policy
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1. introduction, 2. background, 4. findings, 5. discussion, 6. conclusion and final remarks, supplementary material, data availability, conflict of interest statement., acknowledgements.

  • < Previous

Evaluation of research proposals by peer review panels: broader panels for broader assessments?

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Rebecca Abma-Schouten, Joey Gijbels, Wendy Reijmerink, Ingeborg Meijer, Evaluation of research proposals by peer review panels: broader panels for broader assessments?, Science and Public Policy , Volume 50, Issue 4, August 2023, Pages 619–632, https://doi.org/10.1093/scipol/scad009

  • Permissions Icon Permissions

Panel peer review is widely used to decide which research proposals receive funding. Through this exploratory observational study at two large biomedical and health research funders in the Netherlands, we gain insight into how scientific quality and societal relevance are discussed in panel meetings. We explore, in ten review panel meetings of biomedical and health funding programmes, how panel composition and formal assessment criteria affect the arguments used. We observe that more scientific arguments are used than arguments related to societal relevance and expected impact. Also, more diverse panels result in a wider range of arguments, largely for the benefit of arguments related to societal relevance and impact. We discuss how funders can contribute to the quality of peer review by creating a shared conceptual framework that better defines research quality and societal relevance. We also contribute to a further understanding of the role of diverse peer review panels.

Scientific biomedical and health research is often supported by project or programme grants from public funding agencies such as governmental research funders and charities. Research funders primarily rely on peer review, often a combination of independent written review and discussion in a peer review panel, to inform their funding decisions. Peer review panels have the difficult task of integrating and balancing the various assessment criteria to select and rank the eligible proposals. With the increasing emphasis on societal benefit and being responsive to societal needs, the assessment of research proposals ought to include broader assessment criteria, including both scientific quality and societal relevance, and a broader perspective on relevant peers. This results in new practices of including non-scientific peers in review panels ( Del Carmen Calatrava Moreno et al. 2019 ; Den Oudendammer et al. 2019 ; Van den Brink et al. 2016 ). Relevant peers, in the context of biomedical and health research, include, for example, health-care professionals, (healthcare) policymakers, and patients as the (end-)users of research.

Currently, in scientific and grey literature, much attention is paid to what legitimate criteria are and to deficiencies in the peer review process, for example, focusing on the role of chance and the difficulty of assessing interdisciplinary or ‘blue sky’ research ( Langfeldt 2006 ; Roumbanis 2021a ). Our research primarily builds upon the work of Lamont (2009) , Huutoniemi (2012) , and Kolarz et al. (2016) . Their work articulates how the discourse in peer review panels can be understood by giving insight into disciplinary assessment cultures and social dynamics, as well as how panel members define and value concepts such as scientific excellence, interdisciplinarity, and societal impact. At the same time, there is little empirical work on what actually is discussed in peer review meetings and to what extent this is related to the specific objectives of the research funding programme. Such observational work is especially lacking in the biomedical and health domain.

The aim of our exploratory study is to learn what arguments panel members use in a review meeting when assessing research proposals in biomedical and health research programmes. We explore how arguments used in peer review panels are affected by (1) the formal assessment criteria and (2) the inclusion of non-scientific peers in review panels, also called (end-)users of research, societal stakeholders, or societal actors. We add to the existing literature by focusing on the actual arguments used in peer review assessment in practice.

To this end, we observed ten panel meetings in a variety of eight biomedical and health research programmes at two large research funders in the Netherlands: the governmental research funder The Netherlands Organisation for Health Research and Development (ZonMw) and the charitable research funder the Dutch Heart Foundation (DHF). Our first research question focuses on what arguments panel members use when assessing research proposals in a review meeting. The second examines to what extent these arguments correspond with the formal −as described in the programme brochure and assessment form− criteria on scientific quality and societal impact creation. The third question focuses on how arguments used differ between panel members with different perspectives.

2.1 Relation between science and society

To understand the dual focus of scientific quality and societal relevance in research funding, a theoretical understanding and a practical operationalisation of the relation between science and society are needed. The conceptualisation of this relationship affects both who are perceived as relevant peers in the review process and the criteria by which research proposals are assessed.

The relationship between science and society is not constant over time nor static, yet a relation that is much debated. Scientific knowledge can have a huge impact on societies, either intended or unintended. Vice versa, the social environment and structure in which science takes place influence the rate of development, the topics of interest, and the content of science. However, the second part of this inter-relatedness between science and society generally receives less attention ( Merton 1968 ; Weingart 1999 ).

From a historical perspective, scientific and technological progress contributed to the view that science was valuable on its own account and that science and the scientist stood independent of society. While this protected science from unwarranted political influence, societal disengagement with science resulted in less authority by science and debate about its contribution to society. This interdependence and mutual influence contributed to a modern view of science in which knowledge development is valued both on its own merit and for its impact on, and interaction with, society. As such, societal factors and problems are important drivers for scientific research. This warrants that the relation and boundaries between science, society, and politics need to be organised and constantly reinforced and reiterated ( Merton 1968 ; Shapin 2008 ; Weingart 1999 ).

Glerup and Horst (2014) conceptualise the value of science to society and the role of society in science in four rationalities that reflect different justifications for their relation and thus also for who is responsible for (assessing) the societal value of science. The rationalities are arranged along two axes: one is related to the internal or external regulation of science and the other is related to either the process or the outcome of science as the object of steering. The first two rationalities of Reflexivity and Demarcation focus on internal regulation in the scientific community. Reflexivity focuses on the outcome. Central is that science, and thus, scientists should learn from societal problems and provide solutions. Demarcation focuses on the process: science should continuously question its own motives and methods. The latter two rationalities of Contribution and Integration focus on external regulation. The core of the outcome-oriented Contribution rationality is that scientists do not necessarily see themselves as ‘working for the public good’. Science should thus be regulated by society to ensure that outcomes are useful. The central idea of the process-oriented Integration rationality is that societal actors should be involved in science in order to influence the direction of research.

Research funders can be seen as external or societal regulators of science. They can focus on organising the process of science, Integration, or on scientific outcomes that function as solutions for societal challenges, Contribution. In the Contribution perspective, a funder could enhance outside (societal) involvement in science to ensure that scientists take responsibility to deliver results that are needed and used by society. From Integration follows that actors from science and society need to work together in order to produce the best results. In this perspective, there is a lack of integration between science and society and more collaboration and dialogue are needed to develop a new kind of integrative responsibility ( Glerup and Horst 2014 ). This argues for the inclusion of other types of evaluators in research assessment. In reality, these rationalities are not mutually exclusive and also not strictly separated. As a consequence, multiple rationalities can be recognised in the reasoning of scientists and in the policies of research funders today.

2.2 Criteria for research quality and societal relevance

The rationalities of Glerup and Horst have consequences for which language is used to discuss societal relevance and impact in research proposals. Even though the main ingredients are quite similar, as a consequence of the coexisting rationalities in science, societal aspects can be defined and operationalised in different ways ( Alla et al. 2017 ). In the definition of societal impact by Reed, emphasis is placed on the outcome : the contribution to society. It includes the significance for society, the size of potential impact, and the reach , the number of people or organisations benefiting from the expected outcomes ( Reed et al. 2021 ). Other models and definitions focus more on the process of science and its interaction with society. Spaapen and Van Drooge introduced productive interactions in the assessment of societal impact, highlighting a direct contact between researchers and other actors. A key idea is that the interaction in different domains leads to impact in different domains ( Meijer 2012 ; Spaapen and Van Drooge 2011 ). Definitions that focus on the process often refer to societal impact as (1) something that can take place in distinguishable societal domains, (2) something that needs to be actively pursued, and (3) something that requires interactions with societal stakeholders (or users of research) ( Hughes and Kitson 2012 ; Spaapen and Van Drooge 2011 ).

Glerup and Horst show that process and outcome-oriented aspects can be combined in the operationalisation of criteria for assessing research proposals on societal aspects. Also, the funders participating in this study include the outcome—the value created in different domains—and the process—productive interactions with stakeholders—in their formal assessment criteria for societal relevance and impact. Different labels are used for these criteria, such as societal relevance , societal quality , and societal impact ( Abma-Schouten 2017 ; Reijmerink and Oortwijn 2017 ). In this paper, we use societal relevance or societal relevance and impact .

Scientific quality in research assessment frequently refers to all aspects and activities in the study that contribute to the validity and reliability of the research results and that contribute to the integrity and quality of the research process itself. The criteria commonly include the relevance of the proposal for the funding programme, the scientific relevance, originality, innovativeness, methodology, and feasibility ( Abdoul et al. 2012 ). Several studies demonstrated that quality is seen as not only a rich concept but also a complex concept in which excellence and innovativeness, methodological aspects, engagement of stakeholders, multidisciplinary collaboration, and societal relevance all play a role ( Geurts 2016 ; Roumbanis 2019 ; Scholten et al. 2018 ). Another study showed a comprehensive definition of ‘good’ science, which includes creativity, reproducibility, perseverance, intellectual courage, and personal integrity. It demonstrated that ‘good’ science involves not only scientific excellence but also personal values and ethics, and engagement with society ( Van den Brink et al. 2016 ). Noticeable in these studies is the connection made between societal relevance and scientific quality.

In summary, the criteria for scientific quality and societal relevance are conceptualised in different ways, and perspectives on the role of societal value creation and the involvement of societal actors vary strongly. Research funders hence have to pay attention to the meaning of the criteria for the panel members they recruit to help them, and navigate and negotiate how the criteria are applied in assessing research proposals. To be able to do so, more insight is needed in which elements of scientific quality and societal relevance are discussed in practice by peer review panels.

2.3 Role of funders and societal actors in peer review

National governments and charities are important funders of biomedical and health research. How this funding is distributed varies per country. Project funding is frequently allocated based on research programming by specialised public funding organisations, such as the Dutch Research Council in the Netherlands and ZonMw for health research. The DHF, the second largest private non-profit research funder in the Netherlands, provides project funding ( Private Non-Profit Financiering 2020 ). Funders, as so-called boundary organisations, can act as key intermediaries between government, science, and society ( Jasanoff 2011 ). Their responsibility is to develop effective research policies connecting societal demands and scientific ‘supply’. This includes setting up and executing fair and balanced assessment procedures ( Sarewitz and Pielke 2007 ). Herein, the role of societal stakeholders is receiving increasing attention ( Benedictus et al. 2016 ; De Rijcke et al. 2016 ; Dijstelbloem et al. 2013 ; Scholten et al. 2018 ).

All charitable health research funders in the Netherlands have, in the last decade, included patients at different stages of the funding process, including in assessing research proposals ( Den Oudendammer et al. 2019 ). To facilitate research funders in involving patients in assessing research proposals, the federation of Dutch patient organisations set up an independent reviewer panel with (at-risk) patients and direct caregivers ( Patiëntenfederatie Nederland, n.d .). Other foundations have set up societal advisory panels including a wider range of societal actors than patients alone. The Committee Societal Quality (CSQ) of the DHF includes, for example, (at-risk) patients and a wide range of cardiovascular health-care professionals who are not active as academic researchers. This model is also applied by the Diabetes Foundation and the Princess Beatrix Muscle Foundation in the Netherlands ( Diabetesfonds, n.d .; Prinses Beatrix Spierfonds, n.d .).

In 2014, the Lancet presented a series of five papers about biomedical and health research known as the ‘increasing value, reducing waste’ series ( Macleod et al. 2014 ). The authors addressed several issues as well as potential solutions that funders can implement. They highlight, among others, the importance of improving the societal relevance of the research questions and including the burden of disease in research assessment in order to increase the value of biomedical and health science for society. A better understanding of and an increasing role of users of research are also part of the described solutions ( Chalmers et al. 2014 ; Van den Brink et al. 2016 ). This is also in line with the recommendations of the 2013 Declaration on Research Assessment (DORA) ( DORA 2013 ). These recommendations influence the way in which research funders operationalise their criteria in research assessment, how they balance the judgement of scientific and societal aspects, and how they involve societal stakeholders in peer review.

2.4 Panel peer review of research proposals

To assess research proposals, funders rely on the services of peer experts to review the thousands or perhaps millions of research proposals seeking funding each year. While often associated with scholarly publishing, peer review also includes the ex ante assessment of research grant and fellowship applications ( Abdoul et al. 2012 ). Peer review of proposals often includes a written assessment of a proposal by an anonymous peer and a peer review panel meeting to select the proposals eligible for funding. Peer review is an established component of professional academic practice, is deeply embedded in the research culture, and essentially consists of experts in a given domain appraising the professional performance, creativity, and/or quality of scientific work produced by others in their field of competence ( Demicheli and Di Pietrantonj 2007 ). The history of peer review as the default approach for scientific evaluation and accountability is, however, relatively young. While the term was unheard of in the 1960s, by 1970, it had become the standard. Since that time, peer review has become increasingly diverse and formalised, resulting in more public accountability ( Reinhart and Schendzielorz 2021 ).

While many studies have been conducted concerning peer review in scholarly publishing, peer review in grant allocation processes has been less discussed ( Demicheli and Di Pietrantonj 2007 ). The most extensive work on this topic has been conducted by Lamont (2009) . Lamont studied peer review panels in five American research funding organisations, including observing three panels. Other examples include Roumbanis’s ethnographic observations of ten review panels at the Swedish Research Council in natural and engineering sciences ( Roumbanis 2017 , 2021a ). Also, Huutoniemi was able to study, but not observe, four panels on environmental studies and social sciences of the Academy of Finland ( Huutoniemi 2012 ). Additionally, Van Arensbergen and Van den Besselaar (2012) analysed peer review through interviews and by analysing the scores and outcomes at different stages of the peer review process in a talent funding programme. In particular, interesting is the study by Luo and colleagues on 164 written panel review reports, showing that the reviews from panels that included non-scientific peers described broader and more concrete impact topics. Mixed panels also more often connected research processes and characteristics of applicants with impact creation ( Luo et al. 2021 ).

While these studies primarily focused on peer review panels in other disciplinary domains or are based on interviews or reports instead of direct observations, we believe that many of the findings are relevant to the functioning of panels in the context of biomedical and health research. From this literature, we learn to have realistic expectations of peer review. It is inherently difficult to predict in advance which research projects will provide the most important findings or breakthroughs ( Lee et al. 2013 ; Pier et al. 2018 ; Roumbanis 2021a , 2021b ). At the same time, these limitations may not substantiate the replacement of peer review by another assessment approach ( Wessely 1998 ). Many topics addressed in the literature are inter-related and relevant to our study, such as disciplinary differences and interdisciplinarity, social dynamics and their consequences for consistency and bias, and suggestions to improve panel peer review ( Lamont and Huutoniemi 2011 ; Lee et al. 2013 ; Pier et al. 2018 ; Roumbanis 2021a , b ; Wessely 1998 ).

Different scientific disciplines show different preferences and beliefs about how to build knowledge and thus have different perceptions of excellence. However, panellists are willing to respect and acknowledge other standards of excellence ( Lamont 2009 ). Evaluation cultures also differ between scientific fields. Science, technology, engineering, and mathematics panels might, in comparison with panellists from social sciences and humanities, be more concerned with the consistency of the assessment across panels and therefore with clear definitions and uses of assessment criteria ( Lamont and Huutoniemi 2011 ). However, much is still to learn about how panellists’ cognitive affiliations with particular disciplines unfold in the evaluation process. Therefore, the assessment of interdisciplinary research is much more complex than just improving the criteria or procedure because less explicit repertoires would also need to change ( Huutoniemi 2012 ).

Social dynamics play a role as panellists may differ in their motivation to engage in allocation processes, which could create bias ( Lee et al. 2013 ). Placing emphasis on meeting established standards or thoroughness in peer review may promote uncontroversial and safe projects, especially in a situation where strong competition puts pressure on experts to reach a consensus ( Langfeldt 2001 ,2006 ). Personal interest and cognitive similarity may also contribute to conservative bias, which could negatively affect controversial or frontier science ( Luukkonen 2012 ; Roumbanis 2021a ; Travis and Collins 1991 ). Central in this part of literature is that panel conclusions are the outcome of and are influenced by the group interaction ( Van Arensbergen et al. 2014a ). Differences in, for example, the status and expertise of the panel members can play an important role in group dynamics. Insights from social psychology on group dynamics can help in understanding and avoiding bias in peer review panels ( Olbrecht and Bornmann 2010 ). For example, group performance research shows that more diverse groups with complementary skills make better group decisions than homogenous groups. Yet, heterogeneity can also increase conflict within the group ( Forsyth 1999 ). Therefore, it is important to pay attention to power dynamics and maintain team spirit and good communication ( Van Arensbergen et al. 2014a ), especially in meetings that include both scientific and non-scientific peers.

The literature also provides funders with starting points to improve the peer review process. For example, the explicitness of review procedures positively influences the decision-making processes ( Langfeldt 2001 ). Strategic voting and decision-making appear to be less frequent in panels that rate than in panels that rank proposals. Also, an advisory instead of a decisional role may improve the quality of the panel assessment ( Lamont and Huutoniemi 2011 ).

Despite different disciplinary evaluative cultures, formal procedures, and criteria, panel members with different backgrounds develop shared customary rules of deliberation that facilitate agreement and help avoid situations of conflict ( Huutoniemi 2012 ; Lamont 2009 ). This is a necessary prerequisite for opening up peer review panels to include non-academic experts. When doing so, it is important to realise that panel review is a social, emotional, and interactional process. It is therefore important to also take these non-cognitive aspects into account when studying cognitive aspects ( Lamont and Guetzkow 2016 ), as we do in this study.

In summary, what we learn from the literature is that (1) the specific criteria to operationalise scientific quality and societal relevance of research are important, (2) the rationalities from Glerup and Horst predict that not everyone values societal aspects and involve non-scientists in peer review to the same extent and in the same way, (3) this may affect the way peer review panels discuss these aspects, and (4) peer review is a challenging group process that could accommodate other rationalities in order to prevent bias towards specific scientific criteria. To disentangle these aspects, we have carried out an observational study of a diverse range of peer review panel sessions using a fixed set of criteria focusing on scientific quality and societal relevance.

3.1 Research assessment at ZonMw and the DHF

The peer review approach and the criteria used by both the DHF and ZonMw are largely comparable. Funding programmes at both organisations start with a brochure describing the purposes, goals, and conditions for research applications, as well as the assessment procedure and criteria. Both organisations apply a two-stage process. In the first phase, reviewers are asked to write a peer review. In the second phase, a panel reviews the application based on the advice of the written reviews and the applicants’ rebuttal. The panels advise the board on eligible proposals for funding including a ranking of these proposals.

There are also differences between the two organisations. At ZonMw, the criteria for societal relevance and quality are operationalised in the ZonMw Framework Fostering Responsible Research Practices ( Reijmerink and Oortwijn 2017 ). This contributes to a common operationalisation of both quality and societal relevance on the level of individual funding programmes. Important elements in the criteria for societal relevance are, for instance, stakeholder participation, (applying) holistic health concepts, and the added value of knowledge in practice, policy, and education. The framework was developed to optimise the funding process from the perspective of knowledge utilisation and includes concepts like productive interactions and Open Science. It is part of the ZonMw Impact Assessment Framework aimed at guiding the planning, monitoring, and evaluation of funding programmes ( Reijmerink et al. 2020 ). At ZonMw, interdisciplinary panels are set up specifically for each funding programme. Panels are interdisciplinary in nature with academics of a wide range of disciplines and often include non-academic peers, like policymakers, health-care professionals, and patients.

At the DHF, the criteria for scientific quality and societal relevance, at the DHF called societal impact , find their origin in the strategy report of the advisory committee CardioVascular Research Netherlands ( Reneman et al. 2010 ). This report forms the basis of the DHF research policy focusing on scientific and societal impact by creating national collaborations in thematic, interdisciplinary research programmes (the so-called consortia) connecting preclinical and clinical expertise into one concerted effort. An International Scientific Advisory Committee (ISAC) was established to assess these thematic consortia. This panel consists of international scientists, primarily with expertise in the broad cardiovascular research field. The DHF criteria for societal impact were redeveloped in 2013 in collaboration with their CSQ. This panel assesses and advises on the societal aspects of proposed studies. The societal impact criteria include the relevance of the health-care problem, the expected contribution to a solution, attention to the next step in science and towards implementation in practice, and the involvement of and interaction with (end-)users of research (R.Y. Abma-Schouten and I.M. Meijer, unpublished data). Peer review panels for consortium funding are generally composed of members of the ISAC, members of the CSQ, and ad hoc panel members relevant to the specific programme. CSQ members often have a pre-meeting before the final panel meetings to prepare and empower CSQ representatives participating in the peer review panel.

3.2 Selection of funding programmes

To compare and evaluate observations between the two organisations, we selected funding programmes that were relatively comparable in scope and aims. The criteria were (1) a translational and/or clinical objective and (2) the selection procedure consisted of review panels that were responsible for the (final) relevance and quality assessment of grant applications. In total, we selected eight programmes: four at each organisation. At the DHF, two programmes were chosen in which the CSQ did not participate to better disentangle the role of the panel composition. For each programme, we observed the selection process varying from one session on one day (taking 2–8 h) to multiple sessions over several days. Ten sessions were observed in total, of which eight were final peer review panel meetings and two were CSQ meetings preparing for the panel meeting.

After management approval for the study in both organisations, we asked programme managers and panel chairpersons of the programmes that were selected for their consent for observation; none refused participation. Panel members were, in a passive consent procedure, informed about the planned observation and anonymous analyses.

To ensure the independence of this evaluation, the selection of the grant programmes, and peer review panels observed, was at the discretion of the project team of this study. The observations and supervision of the analyses were performed by the senior author not affiliated with the funders.

3.3 Observation matrix

Given the lack of a common operationalisation for scientific quality and societal relevance, we decided to use an observation matrix with a fixed set of detailed aspects as a gold standard to score the brochures, the assessment forms, and the arguments used in panel meetings. The matrix used for the observations of the review panels was based upon and adapted from a ‘grant committee observation matrix’ developed by Van Arensbergen. The original matrix informed a literature review on the selection of talent through peer review and the social dynamics in grant review committees ( van Arensbergen et al. 2014b ). The matrix includes four categories of aspects that operationalise societal relevance, scientific quality, committee, and applicant (see  Table 1 ). The aspects of scientific quality and societal relevance were adapted to fit the operationalisation of scientific quality and societal relevance of the organisations involved. The aspects concerning societal relevance were derived from the CSQ criteria, and the aspects concerning scientific quality were based on the scientific criteria of the first panel observed. The four argument types related to the panel were kept as they were. This committee-related category reflects statements that are related to the personal experience or preference of a panel member and can be seen as signals for bias. This category also includes statements that compare a project with another project without further substantiation. The three applicant-related arguments in the original observation matrix were extended with a fourth on social skills in communication with society. We added health technology assessment (HTA) because one programme specifically focused on this aspect. We tested our version of the observation matrix in pilot observations.

Aspects included in the observation matrix and examples of arguments.

3.4 Observations

Data were primarily collected through observations. Our observations of review panel meetings were non-participatory: the observer and goal of the observation were introduced at the start of the meeting, without further interactions during the meeting. To aid in the processing of observations, some meetings were audiotaped (sound only). Presentations or responses of applicants were not noted and were not part of the analysis. The observer made notes on the ongoing discussion and scored the arguments while listening. One meeting was not attended in person and only observed and scored by listening to the audiotape recording. Because this made identification of the panel members unreliable, this panel meeting was excluded from the analysis of the third research question on how arguments used differ between panel members with different perspectives.

3.5 Grant programmes and the assessment criteria

We gathered and analysed all brochures and assessment forms used by the review panels in order to answer our second research question on the correspondence of arguments used with the formal criteria. Several programmes consisted of multiple grant calls: in that case, the specific call brochure was gathered and analysed, not the overall programme brochure. Additional documentation (e.g. instructional presentations at the start of the panel meeting) was not included in the document analysis. All included documents were marked using the aforementioned observation matrix. The panel-related arguments were not used because this category reflects the personal arguments of panel members that are not part of brochures or instructions. To avoid potential differences in scoring methods, two of the authors independently scored half of the documents that were checked and validated afterwards by the other. Differences were discussed until a consensus was reached.

3.6 Panel composition

In order to answer the third research question, background information on panel members was collected. We categorised the panel members into five common types of panel members: scientific, clinical scientific, health-care professional/clinical, patient, and policy. First, a list of all panel members was composed including their scientific and professional backgrounds and affiliations. The theoretical notion that reviewers represent different types of users of research and therefore potential impact domains (academic, social, economic, and cultural) was leading in the categorisation ( Meijer 2012 ; Spaapen and Van Drooge 2011 ). Because clinical researchers play a dual role in both advancing research as a fellow academic and as a user of the research output in health-care practice, we divided the academic members into two categories of non-clinical and clinical researchers. Multiple types of professional actors participated in each review panel. These were divided into two groups for the analysis: health-care professionals (without current academic activity) and policymakers in the health-care sector. No representatives of the private sector participated in the observed review panels. From the public domain, (at-risk) patients and patient representatives were part of several review panels. Only publicly available information was used to classify the panel members. Members were assigned to one category only: categorisation took place based on the specific role and expertise for which they were appointed to the panel.

In two of the four DHF programmes, the assessment procedure included the CSQ. In these two programmes, representatives of this CSQ participated in the scientific panel to articulate the findings of the CSQ meeting during the final assessment meeting. Two grant programmes were assessed by a review panel with solely (clinical) scientific members.

3.7 Analysis

Data were processed using ATLAS.ti 8 and Microsoft Excel 2010 to produce descriptive statistics. All observed arguments were coded and given a randomised identification code for the panel member using that particular argument. The number of times an argument type was observed was used as an indicator for the relative importance of that argument in the appraisal of proposals. With this approach, a practical and reproducible method for research funders to evaluate the effect of policy changes on peer review was developed. If codes or notes were unclear, post-observation validation of codes was carried out based on observation matrix notes. Arguments that were noted by the observer but could not be matched with an existing code were first coded as a ‘non-existing’ code, and these were resolved by listening back to the audiotapes. Arguments that could not be assigned to a panel member were assigned a ‘missing panel member’ code. A total of 4.7 per cent of all codes were assigned a ‘missing panel member’ code.

After the analyses, two meetings were held to reflect on the results: one with the CSQ and the other with the programme coordinators of both organisations. The goal of these meetings was to improve our interpretation of the findings, disseminate the results derived from this project, and identify topics for further analyses or future studies.

3.8 Limitations

Our study focuses on studying the final phase of the peer review process of research applications in a real-life setting. Our design, a non-participant observation of peer review panels, also introduced several challenges ( Liu and Maitlis 2010 ).

First, the independent review phase or pre-application phase was not part of our study. We therefore could not assess to what extent attention to certain aspects of scientific quality or societal relevance and impact in the review phase influenced the topics discussed during the meeting.

Second, the most important challenge of overt non-participant observations is the observer effect: the danger of causing reactivity in those under study. We believe that the consequences of this effect on our conclusions were limited because panellists are used to external observers in the meetings of these two funders. The observer briefly explained the goal of the study during the introductory round of the panel in general terms. The observer sat as unobtrusively as possible and avoided reactivity to discussions. Similar to previous observations of panels, we experienced that the fact that an observer was present faded into the background during a meeting ( Roumbanis 2021a ). However, a limited observer effect can never be entirely excluded.

Third, our design to only score the arguments raised, and not the responses of the applicant, or information on the content of the proposals, has its positives and negatives. With this approach, we could assure the anonymity of the grant procedures reviewed, the applicants and proposals, panels, and individual panellists. This was an important condition for the funders involved. We took the frequency arguments used as a proxy for the relative importance of that argument in decision-making, which undeniably also has its caveats. Our data collection approach limits more in-depth reflection on which arguments were decisive in decision-making and on group dynamics during the interaction with the applicants as non-verbal and non-content-related comments were not captured in this study.

Fourth, despite this being one of the largest observational studies on the peer review assessment of grant applications with the observation of ten panels in eight grant programmes, many variables might explain differences in arguments used within and beyond our view. Examples of ‘confounding’ variables are the many variations in panel composition, the differences in objectives of the programmes, and the range of the funding programmes. Our study should therefore be seen as exploratory and thus warrants caution in drawing conclusions.

4.1 Overview of observational data

The grant programmes included in this study reflected a broad range of biomedical and health funding programmes, ranging from fellowship grants to translational research and applied health research. All formal documents available to the applicants and to the review panel were retrieved for both ZonMw and the DHF. In total, eighteen documents corresponding to the eight grant programmes were studied. The number of proposals assessed per programme varied from three to thirty-three. The duration of the panel meetings varied between 2 h and two consecutive days. Together, this resulted in a large spread in the number of total arguments used in an individual meeting and in a grant programme as a whole. In the shortest meeting, 49 arguments were observed versus 254 in the longest, with a mean of 126 arguments per meeting and on average 15 arguments per proposal.

We found consistency between how criteria were operationalised in the grant programme’s brochures and in the assessment forms of the review panels overall. At the same time, because the number of elements included in the observation matrix is limited, there was a considerable diversity in the arguments that fall within each aspect (see examples in  Table 1 ). Some of these differences could possibly be explained by differences in language used and the level of detail in the observation matrix, the brochure, and the panel’s instructions. This was especially the case in the applicant-related aspects in which the observation matrix was more detailed than the text in the brochure and assessment forms.

In interpretating our findings, it is important to take into account that, even though our data were largely complete and the observation matrix matched well with the description of the criteria in the brochures and assessment forms, there was a large diversity in the type and number of arguments used and in the number of proposals assessed in the grant programmes included in our study.

4.2 Wide range of arguments used by panels: scientific arguments used most

For our first research question, we explored the number and type of arguments used in the panel meetings. Figure 1 provides an overview of the arguments used. Scientific quality was discussed most. The number of times the feasibility of the aims was discussed clearly stands out in comparison to all other arguments. Also, the match between the science and the problem studied and the plan of work were frequently discussed aspects of scientific quality. International competitiveness of the proposal was discussed the least of all five scientific arguments.

The number of arguments used in panel meetings.

The number of arguments used in panel meetings.

Attention was paid to societal relevance and impact in the panel meetings of both organisations. Yet, the language used differed somewhat between organisations. The contribution to a solution and the next step in science were the most often used societal arguments. At ZonMw, the impact of the health-care problem studied and the activities towards partners were less frequently discussed than the other three societal arguments. At the DHF, the five societal arguments were used equally often.

With the exception of the fellowship programme meeting, applicant-related arguments were not often used. The fellowship panel used arguments related to the applicant and to scientific quality about equally often. Committee-related arguments were also rarely used in the majority of the eight grant programmes observed. In three out of the ten panel meetings, one or two arguments were observed, which were related to personal experience with the applicant or their direct network. In seven out of ten meetings, statements were observed, which were unasserted or were explicitly announced as reflecting a personal preference. The frequency varied between one and seven statements (sixteen in total), which is low in comparison to the other arguments used (see  Fig. 1 for examples).

4.3 Use of arguments varied strongly per panel meeting

The balance in the use of scientific and societal arguments varied strongly per grant programme, panel, and organisation. At ZonMw, two meetings had approximately an equal balance in societal and scientific arguments. In the other two meetings, scientific arguments were used twice to four times as often as societal arguments. At the DHF, three types of panels were observed. Different patterns in the relative use of societal and scientific arguments were observed for each of these panel types. In the two CSQ-only meetings the societal arguments were used approximately twice as often as scientific arguments. In the two meetings of the scientific panels, societal arguments were infrequently used (between zero and four times per argument category). In the combined societal and scientific panel meetings, the use of societal and scientific arguments was more balanced.

4.4 Match of arguments used by panels with the assessment criteria

In order to answer our second research question, we looked into the relation of the arguments used with the formal criteria. We observed that a broader range of arguments were often used in comparison to how the criteria were described in the brochure and assessment instruction. However, arguments related to aspects that were consequently included in the brochure and instruction seemed to be discussed more frequently than in programmes where those aspects were not consistently included or were not included at all. Although the match of the science with the health-care problem and the background and reputation of the applicant were not always made explicit in the brochure or instructions, they were discussed in many panel meetings. Supplementary Fig. S1 provides a visualisation of how arguments used differ between the programmes in which those aspects were, were not, consistently included in the brochure and instruction forms.

4.5 Two-thirds of the assessment was driven by scientific panel members

To answer our third question, we looked into the differences in arguments used between panel members representing a scientific, clinical scientific, professional, policy, or patient perspective. In each research programme, the majority of panellists had a scientific background ( n  = 35), thirty-four members had a clinical scientific background, twenty had a health professional/clinical background, eight members represented a policy perspective, and fifteen represented a patient perspective. From the total number of arguments (1,097), two-thirds were made by members with a scientific or clinical scientific perspective. Members with a scientific background engaged most actively in the discussion with a mean of twelve arguments per member. Similarly, clinical scientists and health-care professionals participated with a mean of nine arguments, and members with a policy and patient perspective put forward the least number of arguments on average, namely, seven and eight. Figure 2 provides a complete overview of the total and mean number of arguments used by the different disciplines in the various panels.

The total and mean number of arguments displayed per subgroup of panel members.

The total and mean number of arguments displayed per subgroup of panel members.

4.6 Diverse use of arguments by panellists, but background matters

In meetings of both organisations, we observed a diverse use of arguments by the panel members. Yet, the use of arguments varied depending on the background of the panel member (see  Fig. 3 ). Those with a scientific and clinical scientific perspective used primarily scientific arguments. As could be expected, health-care professionals and patients used societal arguments more often.

The use of arguments differentiated by panel member background.

The use of arguments differentiated by panel member background.

Further breakdown of arguments across backgrounds showed clear differences in the use of scientific arguments between the different disciplines of panellists. Scientists and clinical scientists discussed the feasibility of the aims more than twice as often as their second most often uttered element of scientific quality, which was the match between the science and the problem studied . Patients and members with a policy or health professional background put forward fewer but more varied scientific arguments.

Patients and health-care professionals accounted for approximately half of the societal arguments used, despite being a much smaller part of the panel’s overall composition. In other words, members with a scientific perspective were less likely to use societal arguments. The relevance of the health-care problem studied, activities towards partners , and arguments related to participation and diversity were not used often by this group. Patients often used arguments related to patient participation and diversity and activities towards partners , although the frequency of the use of the latter differed per organisation.

The majority of the applicant-related arguments were put forward by scientists, including clinical scientists. Committee-related arguments were very rare and are therefore not differentiated by panel member background, except comments related to a comparison with other applications. These arguments were mainly put forward by panel members with a scientific background. HTA -related arguments were often used by panel members with a scientific perspective. Panel members with other perspectives used this argument scarcely (see Supplementary Figs S2–S4 for the visual presentation of the differences between panel members on all aspects included in the matrix).

5.1 Explanations for arguments used in panels

Our observations show that most arguments for scientific quality were often used. However, except for the feasibility , the frequency of arguments used varied strongly between the meetings and between the individual proposals that were discussed. The fact that most arguments were not consistently used is not surprising given the results from previous studies that showed heterogeneity in grant application assessments and low consistency in comments and scores by independent reviewers ( Abdoul et al. 2012 ; Pier et al. 2018 ). In an analysis of written assessments on nine observed dimensions, no dimension was used in more than 45 per cent of the reviews ( Hartmann and Neidhardt 1990 ).

There are several possible explanations for this heterogeneity. Roumbanis (2021a) described how being responsive to the different challenges in the proposals and to the points of attention arising from the written assessments influenced discussion in panels. Also when a disagreement arises, more time is spent on discussion ( Roumbanis 2021a ). One could infer that unambiguous, and thus not debated, aspects might remain largely undetected in our study. We believe, however, that the main points relevant to the assessment will not remain entirely unmentioned, because most panels in our study started the discussion with a short summary of the proposal, the written assessment, and the rebuttal. Lamont (2009) , however, points out that opening statements serve more goals than merely decision-making. They can also increase the credibility of the panellist, showing their comprehension and balanced assessment of an application. We can therefore not entirely disentangle whether the arguments observed most were also found to be most important or decisive or those were simply the topics that led to most disagreement.

An interesting difference with Roumbanis’ study was the available discussion time per proposal. In our study, most panels handled a limited number of proposals, allowing for longer discussions in comparison with the often 2-min time frame that Roumbanis (2021b) described, potentially contributing to a wider range of arguments being discussed. Limited time per proposal might also limit the number of panellists contributing to the discussion per proposal ( De Bont 2014 ).

5.2 Reducing heterogeneity by improving operationalisation and the consequent use of assessment criteria

We found that the language used for the operationalisation of the assessment criteria in programme brochures and in the observation matrix was much more detailed than in the instruction for the panel, which was often very concise. The exercise also illustrated that many terms were used interchangeably.

This was especially true for the applicant-related aspects. Several panels discussed how talent should be assessed. This confusion is understandable when considering the changing values in research and its assessment ( Moher et al. 2018 ) and the fact that the instruction of the funders was very concise. For example, it was not explicated whether the individual or the team should be assessed. Arensbergen et al. (2014b) described how in grant allocation processes, talent is generally assessed using limited characteristics. More objective and quantifiable outputs often prevailed at the expense of recognising and rewarding a broad variety of skills and traits combining professional, social, and individual capital ( DORA 2013 ).

In addition, committee-related arguments, like personal experiences with the applicant or their institute, were rarely used in our study. Comparisons between proposals were sometimes made without further argumentation, mainly by scientific panel members. This was especially pronounced in one (fellowship) grant programme with a high number of proposals. In this programme, the panel meeting concentrated on quickly comparing the quality of the applicants and of the proposals based on the reviewer’s judgement, instead of a more in-depth discussion of the different aspects of the proposals. Because the review phase was not part of this study, the question of which aspects have been used for the assessment of the proposals in this panel therefore remains partially unanswered. However, weighing and comparing proposals on different aspects and with different inputs is a core element of scientific peer review, both in the review of papers and in the review of grants ( Hirschauer 2010 ). The large role of scientific panel members in comparing proposals is therefore not surprising.

One could anticipate that more consequent language in the operationalising criteria may lead to more clarity for both applicants and panellists and to more consistency in the assessment of research proposals. The trend in our observations was that arguments were used less when the related criteria were not or were consequently included in the brochure and panel instruction. It remains, however, challenging to disentangle the influence of the formal definitions of criteria on the arguments used. Previous studies also encountered difficulties in studying the role of the formal instruction in peer review but concluded that this role is relatively limited ( Langfeldt 2001 ; Reinhart 2010 ).

The lack of a clear operationalisation of criteria can contribute to heterogeneity in peer review as many scholars found that assessors differ in the conceptualisation of good science and to the importance they attach to various aspects of research quality and societal relevance ( Abdoul et al. 2012 ; Geurts 2016 ; Scholten et al. 2018 ; Van den Brink et al. 2016 ). The large variation and absence of a gold standard in the interpretation of scientific quality and societal relevance affect the consistency of peer review. As a consequence, it is challenging to systematically evaluate and improve peer review in order to fund the research that contributes most to science and society. To contribute to responsible research and innovation, it is, therefore, important that funders invest in a more consistent and conscientious peer review process ( Curry et al. 2020 ; DORA 2013 ).

A common conceptualisation of scientific quality and societal relevance and impact could improve the alignment between views on good scientific conduct, programmes’ objectives, and the peer review in practice. Such a conceptualisation could contribute to more transparency and quality in the assessment of research. By involving panel members from all relevant backgrounds, including the research community, health-care professionals, and societal actors, in a better operationalisation of criteria, more inclusive views of good science can be implemented more systematically in the peer review assessment of research proposals. The ZonMw Framework Fostering Responsible Research Practices is an example of an initiative aiming to support standardisation and integration ( Reijmerink et al. 2020 ).

Given the lack of a common definition or conceptualisation of scientific quality and societal relevance, our study made an important decision by choosing to use a fixed set of detailed aspects of two important criteria as a gold standard to score the brochures, the panel instructions, and the arguments used by the panels. This approach proved helpful in disentangling the different components of scientific quality and societal relevance. Having said that, it is important not to oversimplify the causes for heterogeneity in peer review because these substantive arguments are not independent of non-cognitive, emotional, or social aspects ( Lamont and Guetzkow 2016 ; Reinhart 2010 ).

5.3 Do more diverse panels contribute to a broader use of arguments?

Both funders participating in our study have an outspoken public mission that requests sufficient attention to societal aspects in assessment processes. In reality, as observed in several panels, the main focus of peer review meetings is on scientific arguments. Next to the possible explanations earlier, the composition of the panel might play a role in explaining arguments used in panel meetings. Our results have shown that health-care professionals and patients bring in more societal arguments than scientists, including those who are also clinicians. It is, however, not that simple. In the more diverse panels, panel members, regardless of their backgrounds, used more societal arguments than in the less diverse panels.

Observing ten panel meetings was sufficient to explore differences in arguments used by panel members with different backgrounds. The pattern of (primarily) scientific arguments being raised by panels with mainly scientific members is not surprising. After all, it is their main task to assess the scientific content of grant proposals and fit their competencies. As such, one could argue, depending on how one justifies the relationship between science and society, that health-care professionals and patients might be better suited to assess the value for potential users of research results. Scientific panel members and clinical scientists in our study used less arguments that reflect on opening up and connecting science directly to others who can bring it further (being industry, health-care professionals, or other stakeholders). Patients filled this gap since these two types of arguments were the most prevalent type put forward by them. Making an active connection with society apparently needs a broader, more diverse panel for scientists to direct their attention to more societal arguments. Evident from our observations is that in panels with patients and health-care professionals, their presence seemed to increase the attention placed on arguments beyond the scientific arguments put forward by all panel members, including scientists. This conclusion is congruent with the observation that there was a more equal balance in the use of societal and scientific arguments in the scientific panels in which the CSQ participated. This illustrates that opening up peer review panels to non-scientific members creates an opportunity to focus on both the contribution and the integrative rationality ( Glerup and Horst 2014 ) or, in other words, to allow productive interactions between scientific and non-scientific actors. This corresponds with previous research that suggests that with regard to societal aspects, reviews from mixed panels were broader and richer ( Luo et al. 2021 ). In panels with non-scientific experts, more emphasis was placed on the role of the proposed research process to increase the likelihood of societal impact over the causal importance of scientific excellence for broader impacts. This is in line with the findings that panels with more disciplinary diversity, in range and also by including generalist experts, applied more versatile styles to reach consensus and paid more attention to relevance and pragmatic value ( Huutoniemi 2012 ).

Our observations further illustrate that patients and health-care professionals were less vocal in panels than (clinical) scientists and were in the minority. This could reflect their social role and lower perceived authority in the panel. Several guides are available for funders to stimulate the equal participation of patients in science. These guides are also applicable to their involvement in peer review panels. Measures to be taken include the support and training to help prepare patients for their participation in deliberations with renowned scientists and explicitly addressing power differences ( De Wit et al. 2016 ). Panel chairs and programme officers have to set and supervise the conditions for the functioning of both the individual panel members and the panel as a whole ( Lamont 2009 ).

5.4 Suggestions for future studies

In future studies, it is important to further disentangle the role of the operationalisation and appraisal of assessment criteria in reducing heterogeneity in the arguments used by panels. More controlled experimental settings are a valuable addition to the current mainly observational methodologies applied to disentangle some of the cognitive and social factors that influence the functioning and argumentation of peer review panels. Reusing data from the panel observations and the data on the written reports could also provide a starting point for a bottom-up approach to create a more consistent and shared conceptualisation and operationalisation of assessment criteria.

To further understand the effects of opening up review panels to non-scientific peers, it is valuable to compare the role of diversity and interdisciplinarity in solely scientific panels versus panels that also include non-scientific experts.

In future studies, differences between domains and types of research should also be addressed. We hypothesise that biomedical and health research is perhaps more suited for the inclusion of non-scientific peers in panels than other research domains. For example, it is valuable to better understand how potentially relevant users can be well enough identified in other research fields and to what extent non-academics can contribute to assessing the possible value of, especially early or blue sky, research.

The goal of our study was to explore in practice which arguments regarding the main criteria of scientific quality and societal relevance were used by peer review panels of biomedical and health research funding programmes. We showed that there is a wide diversity in the number and range of arguments used, but three main scientific aspects were discussed most frequently. These are the following: is it a feasible approach; does the science match the problem , and is the work plan scientifically sound? Nevertheless, these scientific aspects were accompanied by a significant amount of discussion of societal aspects, of which the contribution to a solution is the most prominent. In comparison with scientific panellists, non-scientific panellists, such as health-care professionals, policymakers, and patients, often use a wider range of arguments and other societal arguments. Even more striking was that, even though non-scientific peers were often outnumbered and less vocal in panels, scientists also used a wider range of arguments when non-scientific peers were present.

It is relevant that two health research funders collaborated in the current study to reflect on and improve peer review in research funding. There are few studies published that describe live observations of peer review panel meetings. Many studies focus on alternatives for peer review or reflect on the outcomes of the peer review process, instead of reflecting on the practice and improvement of peer review assessment of grant proposals. Privacy and confidentiality concerns of funders also contribute to the lack of information on the functioning of peer review panels. In this study, both organisations were willing to participate because of their interest in research funding policies in relation to enhancing the societal value and impact of science. The study provided them with practical suggestions, for example, on how to improve the alignment in language used in programme brochures and instructions of review panels, and contributed to valuable knowledge exchanges between organisations. We hope that this publication stimulates more research funders to evaluate their peer review approach in research funding and share their insights.

For a long time, research funders relied solely on scientists for designing and executing peer review of research proposals, thereby delegating responsibility for the process. Although review panels have a discretionary authority, it is important that funders set and supervise the process and the conditions. We argue that one of these conditions should be the diversification of peer review panels and opening up panels for non-scientific peers.

Supplementary material is available at Science and Public Policy online.

Details of the data and information on how to request access is available from the first author.

Joey Gijbels and Wendy Reijmerink are employed by ZonMw. Rebecca Abma-Schouten is employed by the Dutch Heart Foundation and as external PhD candidate affiliated with the Centre for Science and Technology Studies, Leiden University.

A special thanks to the panel chairs and programme officers of ZonMw and the DHF for their willingness to participate in this project. We thank Diny Stekelenburg, an internship student at ZonMw, for her contributions to the project. Our sincerest gratitude to Prof. Paul Wouters, Sarah Coombs, and Michiel van der Vaart for proofreading and their valuable feedback. Finally, we thank the editors and anonymous reviewers of Science and Public Policy for their thorough and insightful reviews and recommendations. Their contributions are recognisable in the final version of this paper.

Abdoul   H. , Perrey   C. , Amiel   P. , et al.  ( 2012 ) ‘ Peer Review of Grant Applications: Criteria Used and Qualitative Study of Reviewer Practices ’, PLoS One , 7 : 1 – 15 .

Google Scholar

Abma-Schouten   R. Y. ( 2017 ) ‘ Maatschappelijke Kwaliteit van Onderzoeksvoorstellen ’, Dutch Heart Foundation .

Alla   K. , Hall   W. D. , Whiteford   H. A. , et al.  ( 2017 ) ‘ How Do We Define the Policy Impact of Public Health Research? A Systematic Review ’, Health Research Policy and Systems , 15 : 84.

Benedictus   R. , Miedema   F. , and Ferguson   M. W. J. ( 2016 ) ‘ Fewer Numbers, Better Science ’, Nature , 538 : 453 – 4 .

Chalmers   I. , Bracken   M. B. , Djulbegovic   B. , et al.  ( 2014 ) ‘ How to Increase Value and Reduce Waste When Research Priorities Are Set ’, The Lancet , 383 : 156 – 65 .

Curry   S. , De Rijcke   S. , Hatch   A. , et al.  ( 2020 ) ‘ The Changing Role of Funders in Responsible Research Assessment: Progress, Obstacles and the Way Ahead ’, RoRI Working Paper No. 3, London : Research on Research Institute (RoRI) .

De Bont   A. ( 2014 ) ‘ Beoordelen Bekeken. Reflecties op het Werk van Een Programmacommissie van ZonMw ’, ZonMw .

De Rijcke   S. , Wouters   P. F. , Rushforth   A. D. , et al.  ( 2016 ) ‘ Evaluation Practices and Effects of Indicator Use—a Literature Review ’, Research Evaluation , 25 : 161 – 9 .

De Wit   A. M. , Bloemkolk   D. , Teunissen   T. , et al.  ( 2016 ) ‘ Voorwaarden voor Succesvolle Betrokkenheid van Patiënten/cliënten bij Medisch Wetenschappelijk Onderzoek ’, Tijdschrift voor Sociale Gezondheidszorg , 94 : 91 – 100 .

Del Carmen Calatrava Moreno   M. , Warta   K. , Arnold   E. , et al.  ( 2019 ) Science Europe Study on Research Assessment Practices . Technopolis Group Austria .

Google Preview

Demicheli   V. and Di Pietrantonj   C. ( 2007 ) ‘ Peer Review for Improving the Quality of Grant Applications ’, Cochrane Database of Systematic Reviews , 2 : MR000003.

Den Oudendammer   W. M. , Noordhoek   J. , Abma-Schouten   R. Y. , et al.  ( 2019 ) ‘ Patient Participation in Research Funding: An Overview of When, Why and How Amongst Dutch Health Funds ’, Research Involvement and Engagement , 5 .

Diabetesfonds ( n.d. ) Maatschappelijke Adviesraad < https://www.diabetesfonds.nl/over-ons/maatschappelijke-adviesraad > accessed 18 Sept 2022 .

Dijstelbloem   H. , Huisman   F. , Miedema   F. , et al.  ( 2013 ) ‘ Science in Transition Position Paper: Waarom de Wetenschap Niet Werkt Zoals het Moet, En Wat Daar aan te Doen Is ’, Utrecht : Science in Transition .

Forsyth   D. R. ( 1999 ) Group Dynamics , 3rd edn. Belmont : Wadsworth Publishing Company .

Geurts   J. ( 2016 ) ‘ Wat Goed Is, Herken Je Meteen ’, NRC Handelsblad < https://www.nrc.nl/nieuws/2016/10/28/wat-goed-is-herken-je-meteen-4975248-a1529050 > accessed 6 Mar 2022 .

Glerup   C. and Horst   M. ( 2014 ) ‘ Mapping “Social Responsibility” in Science ’, Journal of Responsible Innovation , 1 : 31 – 50 .

Hartmann   I. and Neidhardt   F. ( 1990 ) ‘ Peer Review at the Deutsche Forschungsgemeinschaft ’, Scientometrics , 19 : 419 – 25 .

Hirschauer   S. ( 2010 ) ‘ Editorial Judgments: A Praxeology of “Voting” in Peer Review ’, Social Studies of Science , 40 : 71 – 103 .

Hughes   A. and Kitson   M. ( 2012 ) ‘ Pathways to Impact and the Strategic Role of Universities: New Evidence on the Breadth and Depth of University Knowledge Exchange in the UK and the Factors Constraining Its Development ’, Cambridge Journal of Economics , 36 : 723 – 50 .

Huutoniemi   K. ( 2012 ) ‘ Communicating and Compromising on Disciplinary Expertise in the Peer Review of Research Proposals ’, Social Studies of Science , 42 : 897 – 921 .

Jasanoff   S. ( 2011 ) ‘ Constitutional Moments in Governing Science and Technology ’, Science and Engineering Ethics , 17 : 621 – 38 .

Kolarz   P. , Arnold   E. , Farla   K. , et al.  ( 2016 ) Evaluation of the ESRC Transformative Research Scheme . Brighton : Technopolis Group .

Lamont   M. ( 2009 ) How Professors Think : Inside the Curious World of Academic Judgment . Cambridge : Harvard University Press .

Lamont   M. Guetzkow   J. ( 2016 ) ‘How Quality Is Recognized by Peer Review Panels: The Case of the Humanities’, in M.   Ochsner , S. E.   Hug , and H.-D.   Daniel (eds) Research Assessment in the Humanities , pp. 31 – 41 . Cham : Springer International Publishing .

Lamont   M. Huutoniemi   K. ( 2011 ) ‘Comparing Customary Rules of Fairness: Evaluative Practices in Various Types of Peer Review Panels’, in C.   Charles   G.   Neil and L.   Michèle (eds) Social Knowledge in the Making , pp. 209–32. Chicago : The University of Chicago Press .

Langfeldt   L. ( 2001 ) ‘ The Decision-making Constraints and Processes of Grant Peer Review, and Their Effects on the Review Outcome ’, Social Studies of Science , 31 : 820 – 41 .

——— ( 2006 ) ‘ The Policy Challenges of Peer Review: Managing Bias, Conflict of Interests and Interdisciplinary Assessments ’, Research Evaluation , 15 : 31 – 41 .

Lee   C. J. , Sugimoto   C. R. , Zhang   G. , et al.  ( 2013 ) ‘ Bias in Peer Review ’, Journal of the American Society for Information Science and Technology , 64 : 2 – 17 .

Liu   F. Maitlis   S. ( 2010 ) ‘Nonparticipant Observation’, in A. J.   Mills , G.   Durepos , and E.   Wiebe (eds) Encyclopedia of Case Study Research , pp. 609 – 11 . Los Angeles : SAGE .

Luo   J. , Ma   L. , and Shankar   K. ( 2021 ) ‘ Does the Inclusion of Non-academix Reviewers Make Any Difference for Grant Impact Panels? ’, Science & Public Policy , 48 : 763 – 75 .

Luukkonen   T. ( 2012 ) ‘ Conservatism and Risk-taking in Peer Review: Emerging ERC Practices ’, Research Evaluation , 21 : 48 – 60 .

Macleod   M. R. , Michie   S. , Roberts   I. , et al.  ( 2014 ) ‘ Biomedical Research: Increasing Value, Reducing Waste ’, The Lancet , 383 : 101 – 4 .

Meijer   I. M. ( 2012 ) ‘ Societal Returns of Scientific Research. How Can We Measure It? ’, Leiden : Center for Science and Technology Studies, Leiden University .

Merton   R. K. ( 1968 ) Social Theory and Social Structure , Enlarged edn. [Nachdr.] . New York : The Free Press .

Moher   D. , Naudet   F. , Cristea   I. A. , et al.  ( 2018 ) ‘ Assessing Scientists for Hiring, Promotion, And Tenure ’, PLoS Biology , 16 : e2004089.

Olbrecht   M. and Bornmann   L. ( 2010 ) ‘ Panel Peer Review of Grant Applications: What Do We Know from Research in Social Psychology on Judgment and Decision-making in Groups? ’, Research Evaluation , 19 : 293 – 304 .

Patiëntenfederatie Nederland ( n.d. ) Ervaringsdeskundigen Referentenpanel < https://www.patientenfederatie.nl/zet-je-ervaring-in/lid-worden-van-ons-referentenpanel > accessed 18 Sept 2022.

Pier   E. L. , M.   B. , Filut   A. , et al.  ( 2018 ) ‘ Low Agreement among Reviewers Evaluating the Same NIH Grant Applications ’, Proceedings of the National Academy of Sciences , 115 : 2952 – 7 .

Prinses Beatrix Spierfonds ( n.d. ) Gebruikerscommissie < https://www.spierfonds.nl/wie-wij-zijn/gebruikerscommissie > accessed 18 Sep 2022 .

( 2020 ) Private Non-profit Financiering van Onderzoek in Nederland < https://www.rathenau.nl/nl/wetenschap-cijfers/geld/wat-geeft-nederland-uit-aan-rd/private-non-profit-financiering-van#:∼:text=R%26D%20in%20Nederland%20wordt%20gefinancierd,aan%20wetenschappelijk%20onderzoek%20in%20Nederland > accessed 6 Mar 2022 .

Reneman   R. S. , Breimer   M. L. , Simoons   J. , et al.  ( 2010 ) ‘ De toekomst van het cardiovasculaire onderzoek in Nederland. Sturing op synergie en impact ’, Den Haag : Nederlandse Hartstichting .

Reed   M. S. , Ferré   M. , Marin-Ortega   J. , et al.  ( 2021 ) ‘ Evaluating Impact from Research: A Methodological Framework ’, Research Policy , 50 : 104147.

Reijmerink   W. and Oortwijn   W. ( 2017 ) ‘ Bevorderen van Verantwoorde Onderzoekspraktijken Door ZonMw ’, Beleidsonderzoek Online. accessed 6 Mar 2022.

Reijmerink   W. , Vianen   G. , Bink   M. , et al.  ( 2020 ) ‘ Ensuring Value in Health Research by Funders’ Implementation of EQUATOR Reporting Guidelines: The Case of ZonMw ’, Berlin : REWARD|EQUATOR .

Reinhart   M. ( 2010 ) ‘ Peer Review Practices: A Content Analysis of External Reviews in Science Funding ’, Research Evaluation , 19 : 317 – 31 .

Reinhart   M. and Schendzielorz   C. ( 2021 ) Trends in Peer Review . SocArXiv . < https://osf.io/preprints/socarxiv/nzsp5 > accessed 29 Aug 2022.

Roumbanis   L. ( 2017 ) ‘ Academic Judgments under Uncertainty: A Study of Collective Anchoring Effects in Swedish Research Council Panel Groups ’, Social Studies of Science , 47 : 95 – 116 .

——— ( 2021a ) ‘ Disagreement and Agonistic Chance in Peer Review ’, Science, Technology & Human Values , 47 : 1302 – 33 .

——— ( 2021b ) ‘ The Oracles of Science: On Grant Peer Review and Competitive Funding ’, Social Science Information , 60 : 356 – 62 .

( 2019 ) ‘ Ruimte voor ieders talent (Position Paper) ’, Den Haag : VSNU, NFU, KNAW, NWO en ZonMw . < https://www.universiteitenvannederland.nl/recognitionandrewards/wp-content/uploads/2019/11/Position-paper-Ruimte-voor-ieders-talent.pdf >.

( 2013 ) San Francisco Declaration on Research Assessment . The Declaration . < https://sfdora.org > accessed 2 Jan 2022 .

Sarewitz   D. and Pielke   R. A.  Jr. ( 2007 ) ‘ The Neglected Heart of Science Policy: Reconciling Supply of and Demand for Science ’, Environmental Science & Policy , 10 : 5 – 16 .

Scholten   W. , Van Drooge   L. , and Diederen   P. ( 2018 ) Excellent Is Niet Gewoon. Dertig Jaar Focus op Excellentie in het Nederlandse Wetenschapsbeleid . The Hague : Rathenau Instituut .

Shapin   S. ( 2008 ) The Scientific Life : A Moral History of a Late Modern Vocation . Chicago : University of Chicago press .

Spaapen   J. and Van Drooge   L. ( 2011 ) ‘ Introducing “Productive Interactions” in Social Impact Assessment ’, Research Evaluation , 20 : 211 – 8 .

Travis   G. D. L. and Collins   H. M. ( 1991 ) ‘ New Light on Old Boys: Cognitive and Institutional Particularism in the Peer Review System ’, Science, Technology & Human Values , 16 : 322 – 41 .

Van Arensbergen   P. and Van den Besselaar   P. ( 2012 ) ‘ The Selection of Scientific Talent in the Allocation of Research Grants ’, Higher Education Policy , 25 : 381 – 405 .

Van Arensbergen   P. , Van der Weijden   I. , and Van den Besselaar   P. V. D. ( 2014a ) ‘ The Selection of Talent as a Group Process: A Literature Review on the Social Dynamics of Decision Making in Grant Panels ’, Research Evaluation , 23 : 298 – 311 .

—— ( 2014b ) ‘ Different Views on Scholarly Talent: What Are the Talents We Are Looking for in Science? ’, Research Evaluation , 23 : 273 – 84 .

Van den Brink , G. , Scholten , W. , and Jansen , T. , eds ( 2016 ) Goed Werk voor Academici . Culemborg : Stichting Beroepseer .

Weingart   P. ( 1999 ) ‘ Scientific Expertise and Political Accountability: Paradoxes of Science in Politics ’, Science & Public Policy , 26 : 151 – 61 .

Wessely   S. ( 1998 ) ‘ Peer Review of Grant Applications: What Do We Know? ’, The Lancet , 352 : 301 – 5 .

Supplementary data

Email alerts, citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5430
  • Print ISSN 0302-3427
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Panel on the Evaluation of AIDS Interventions; Coyle SL, Boruch RF, Turner CF, editors. Evaluating AIDS Prevention Programs: Expanded Edition. Washington (DC): National Academies Press (US); 1991.

Cover of Evaluating AIDS Prevention Programs

Evaluating AIDS Prevention Programs: Expanded Edition.

  • Hardcopy Version at National Academies Press

1 Design and Implementation of Evaluation Research

Evaluation has its roots in the social, behavioral, and statistical sciences, and it relies on their principles and methodologies of research, including experimental design, measurement, statistical tests, and direct observation. What distinguishes evaluation research from other social science is that its subjects are ongoing social action programs that are intended to produce individual or collective change. This setting usually engenders a great need for cooperation between those who conduct the program and those who evaluate it. This need for cooperation can be particularly acute in the case of AIDS prevention programs because those programs have been developed rapidly to meet the urgent demands of a changing and deadly epidemic.

Although the characteristics of AIDS intervention programs place some unique demands on evaluation, the techniques for conducting good program evaluation do not need to be invented. Two decades of evaluation research have provided a basic conceptual framework for undertaking such efforts (see, e.g., Campbell and Stanley [1966] and Cook and Campbell [1979] for discussions of outcome evaluation; see Weiss [1972] and Rossi and Freeman [1982] for process and outcome evaluations); in addition, similar programs, such as the antismoking campaigns, have been subject to evaluation, and they offer examples of the problems that have been encountered.

In this chapter the panel provides an overview of the terminology, types, designs, and management of research evaluation. The following chapter provides an overview of program objectives and the selection and measurement of appropriate outcome variables for judging the effectiveness of AIDS intervention programs. These issues are discussed in detail in the subsequent, program-specific Chapters 3 - 5 .

  • Types of Evaluation

The term evaluation implies a variety of different things to different people. The recent report of the Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences defines the area through a series of questions (Turner, Miller, and Moses, 1989:317-318):

Evaluation is a systematic process that produces a trustworthy account of what was attempted and why; through the examination of results—the outcomes of intervention programs—it answers the questions, "What was done?" "To whom, and how?" and "What outcomes were observed?'' Well-designed evaluation permits us to draw inferences from the data and addresses the difficult question: ''What do the outcomes mean?"

These questions differ in the degree of difficulty of answering them. An evaluation that tries to determine the outcomes of an intervention and what those outcomes mean is a more complicated endeavor than an evaluation that assesses the process by which the intervention was delivered. Both kinds of evaluation are necessary because they are intimately connected: to establish a project's success, an evaluator must first ask whether the project was implemented as planned and then whether its objective was achieved. Questions about a project's implementation usually fall under the rubric of process evaluation . If the investigation involves rapid feedback to the project staff or sponsors, particularly at the earliest stages of program implementation, the work is called formative evaluation . Questions about effects or effectiveness are often variously called summative evaluation, impact assessment, or outcome evaluation, the term the panel uses.

Formative evaluation is a special type of early evaluation that occurs during and after a program has been designed but before it is broadly implemented. Formative evaluation is used to understand the need for the intervention and to make tentative decisions about how to implement or improve it. During formative evaluation, information is collected and then fed back to program designers and administrators to enhance program development and maximize the success of the intervention. For example, formative evaluation may be carried out through a pilot project before a program is implemented at several sites. A pilot study of a community-based organization (CBO), for example, might be used to gather data on problems involving access to and recruitment of targeted populations and the utilization and implementation of services; the findings of such a study would then be used to modify (if needed) the planned program.

Another example of formative evaluation is the use of a "story board" design of a TV message that has yet to be produced. A story board is a series of text and sketches of camera shots that are to be produced in a commercial. To evaluate the effectiveness of the message and forecast some of the consequences of actually broadcasting it to the general public, an advertising agency convenes small groups of people to react to and comment on the proposed design.

Once an intervention has been implemented, the next stage of evaluation is process evaluation, which addresses two broad questions: "What was done?" and "To whom, and how?" Ordinarily, process evaluation is carried out at some point in the life of a project to determine how and how well the delivery goals of the program are being met. When intervention programs continue over a long period of time (as is the case for some of the major AIDS prevention programs), measurements at several times are warranted to ensure that the components of the intervention continue to be delivered by the right people, to the right people, in the right manner, and at the right time. Process evaluation can also play a role in improving interventions by providing the information necessary to change delivery strategies or program objectives in a changing epidemic.

Research designs for process evaluation include direct observation of projects, surveys of service providers and clients, and the monitoring of administrative records. The panel notes that the Centers for Disease Control (CDC) is already collecting some administrative records on its counseling and testing program and community-based projects. The panel believes that this type of evaluation should be a continuing and expanded component of intervention projects to guarantee the maintenance of the projects' integrity and responsiveness to their constituencies.

The purpose of outcome evaluation is to identify consequences and to establish that consequences are, indeed, attributable to a project. This type of evaluation answers the questions, "What outcomes were observed?" and, perhaps more importantly, "What do the outcomes mean?" Like process evaluation, outcome evaluation can also be conducted at intervals during an ongoing program, and the panel believes that such periodic evaluation should be done to monitor goal achievement.

The panel believes that these stages of evaluation (i.e., formative, process, and outcome) are essential to learning how AIDS prevention programs contribute to containing the epidemic. After a body of findings has been accumulated from such evaluations, it may be fruitful to launch another stage of evaluation: cost-effectiveness analysis (see Weinstein et al., 1989). Like outcome evaluation, cost-effectiveness analysis also measures program effectiveness, but it extends the analysis by adding a measure of program cost. The panel believes that consideration of cost-effective analysis should be postponed until more experience is gained with formative, process, and outcome evaluation of the CDC AIDS prevention programs.

  • Evaluation Research Design

Process and outcome evaluations require different types of research designs, as discussed below. Formative evaluations, which are intended to both assess implementation and forecast effects, use a mix of these designs.

Process Evaluation Designs

To conduct process evaluations on how well services are delivered, data need to be gathered on the content of interventions and on their delivery systems. Suggested methodologies include direct observation, surveys, and record keeping.

Direct observation designs include case studies, in which participant-observers unobtrusively and systematically record encounters within a program setting, and nonparticipant observation, in which long, open-ended (or "focused") interviews are conducted with program participants. 1 For example, "professional customers" at counseling and testing sites can act as project clients to monitor activities unobtrusively; 2 alternatively, nonparticipant observers can interview both staff and clients. Surveys —either censuses (of the whole population of interest) or samples—elicit information through interviews or questionnaires completed by project participants or potential users of a project. For example, surveys within community-based projects can collect basic statistical information on project objectives, what services are provided, to whom, when, how often, for how long, and in what context.

Record keeping consists of administrative or other reporting systems that monitor use of services. Standardized reporting ensures consistency in the scope and depth of data collected. To use the media campaign as an example, the panel suggests using standardized data on the use of the AIDS hotline to monitor public attentiveness to the advertisements broadcast by the media campaign.

These designs are simple to understand, but they require expertise to implement. For example, observational studies must be conducted by people who are well trained in how to carry out on-site tasks sensitively and to record their findings uniformly. Observers can either complete narrative accounts of what occurred in a service setting or they can complete some sort of data inventory to ensure that multiple aspects of service delivery are covered. These types of studies are time consuming and benefit from corroboration among several observers. The use of surveys in research is well-understood, although they, too, require expertise to be well implemented. As the program chapters reflect, survey data collection must be carefully designed to reduce problems of validity and reliability and, if samples are used, to design an appropriate sampling scheme. Record keeping or service inventories are probably the easiest research designs to implement, although preparing standardized internal forms requires attention to detail about salient aspects of service delivery.

Outcome Evaluation Designs

Research designs for outcome evaluations are meant to assess principal and relative effects. Ideally, to assess the effect of an intervention on program participants, one would like to know what would have happened to the same participants in the absence of the program. Because it is not possible to make this comparison directly, inference strategies that rely on proxies have to be used. Scientists use three general approaches to construct proxies for use in the comparisons required to evaluate the effects of interventions: (1) nonexperimental methods, (2) quasi-experiments, and (3) randomized experiments. The first two are discussed below, and randomized experiments are discussed in the subsequent section.

Nonexperimental and Quasi-Experimental Designs 3

The most common form of nonexperimental design is a before-and-after study. In this design, pre-intervention measurements are compared with equivalent measurements made after the intervention to detect change in the outcome variables that the intervention was designed to influence.

Although the panel finds that before-and-after studies frequently provide helpful insights, the panel believes that these studies do not provide sufficiently reliable information to be the cornerstone for evaluation research on the effectiveness of AIDS prevention programs. The panel's conclusion follows from the fact that the postintervention changes cannot usually be attributed unambiguously to the intervention. 4 Plausible competing explanations for differences between pre-and postintervention measurements will often be numerous, including not only the possible effects of other AIDS intervention programs, news stories, and local events, but also the effects that may result from the maturation of the participants and the educational or sensitizing effects of repeated measurements, among others.

Quasi-experimental and matched control designs provide a separate comparison group. In these designs, the control group may be selected by matching nonparticipants to participants in the treatment group on the basis of selected characteristics. It is difficult to ensure the comparability of the two groups even when they are matched on many characteristics because other relevant factors may have been overlooked or mismatched or they may be difficult to measure (e.g., the motivation to change behavior). In some situations, it may simply be impossible to measure all of the characteristics of the units (e.g., communities) that may affect outcomes, much less demonstrate their comparability.

Matched control designs require extraordinarily comprehensive scientific knowledge about the phenomenon under investigation in order for evaluators to be confident that all of the relevant determinants of outcomes have been properly accounted for in the matching. Three types of information or knowledge are required: (1) knowledge of intervening variables that also affect the outcome of the intervention and, consequently, need adjustment to make the groups comparable; (2) measurements on all intervening variables for all subjects; and (3) knowledge of how to make the adjustments properly, which in turn requires an understanding of the functional relationship between the intervening variables and the outcome variables. Satisfying each of these information requirements is likely to be more difficult than answering the primary evaluation question, "Does this intervention produce beneficial effects?"

Given the size and the national importance of AIDS intervention programs and given the state of current knowledge about behavior change in general and AIDS prevention, in particular, the panel believes that it would be unwise to rely on matching and adjustment strategies as the primary design for evaluating AIDS intervention programs. With differently constituted groups, inferences about results are hostage to uncertainty about the extent to which the observed outcome actually results from the intervention and is not an artifact of intergroup differences that may not have been removed by matching or adjustment.

Randomized Experiments

A remedy to the inferential uncertainties that afflict nonexperimental designs is provided by randomized experiments . In such experiments, one singly constituted group is established for study. A subset of the group is then randomly chosen to receive the intervention, with the other subset becoming the control. The two groups are not identical, but they are comparable. Because they are two random samples drawn from the same population, they are not systematically different in any respect, which is important for all variables—both known and unknown—that can influence the outcome. Dividing a singly constituted group into two random and therefore comparable subgroups cuts through the tangle of causation and establishes a basis for the valid comparison of respondents who do and do not receive the intervention. Randomized experiments provide for clear causal inference by solving the problem of group comparability, and may be used to answer the evaluation questions "Does the intervention work?" and "What works better?"

Which question is answered depends on whether the controls receive an intervention or not. When the object is to estimate whether a given intervention has any effects, individuals are randomly assigned to the project or to a zero-treatment control group. The control group may be put on a waiting list or simply not get the treatment. This design addresses the question, "Does it work?"

When the object is to compare variations on a project—e.g., individual counseling sessions versus group counseling—then individuals are randomly assigned to these two regimens, and there is no zero-treatment control group. This design addresses the question, "What works better?" In either case, the control groups must be followed up as rigorously as the experimental groups.

A randomized experiment requires that individuals, organizations, or other treatment units be randomly assigned to one of two or more treatments or program variations. Random assignment ensures that the estimated differences between the groups so constituted are statistically unbiased; that is, that any differences in effects measured between them are a result of treatment. The absence of statistical bias in groups constituted in this fashion stems from the fact that random assignment ensures that there are no systematic differences between them, differences that can and usually do affect groups composed in ways that are not random. 5 The panel believes this approach is far superior for outcome evaluations of AIDS interventions than the nonrandom and quasi-experimental approaches. Therefore,

To improve interventions that are already broadly implemented, the panel recommends the use of randomized field experiments of alternative or enhanced interventions.

Under certain conditions, the panel also endorses randomized field experiments with a nontreatment control group to evaluate new interventions. In the context of a deadly epidemic, ethics dictate that treatment not be withheld simply for the purpose of conducting an experiment. Nevertheless, there may be times when a randomized field test of a new treatment with a no-treatment control group is worthwhile. One such time is during the design phase of a major or national intervention.

Before a new intervention is broadly implemented, the panel recommends that it be pilot tested in a randomized field experiment.

The panel considered the use of experiments with delayed rather than no treatment. A delayed-treatment control group strategy might be pursued when resources are too scarce for an intervention to be widely distributed at one time. For example, a project site that is waiting to receive funding for an intervention would be designated as the control group. If it is possible to randomize which projects in the queue receive the intervention, an evaluator could measure and compare outcomes after the experimental group had received the new treatment but before the control group received it. The panel believes that such a design can be applied only in limited circumstances, such as when groups would have access to related services in their communities and that conducting the study was likely to lead to greater access or better services. For example, a study cited in Chapter 4 used a randomized delayed-treatment experiment to measure the effects of a community-based risk reduction program. However, such a strategy may be impractical for several reasons, including:

  • sites waiting for funding for an intervention might seek resources from another source;
  • it might be difficult to enlist the nonfunded site and its clients to participate in the study;
  • there could be an appearance of favoritism toward projects whose funding was not delayed.

Although randomized experiments have many benefits, the approach is not without pitfalls. In the planning stages of evaluation, it is necessary to contemplate certain hazards, such as the Hawthorne effect 6 and differential project dropout rates. Precautions must be taken either to prevent these problems or to measure their effects. Fortunately, there is some evidence suggesting that the Hawthorne effect is usually not very large (Rossi and Freeman, 1982:175-176).

Attrition is potentially more damaging to an evaluation, and it must be limited if the experimental design is to be preserved. If sample attrition is not limited in an experimental design, it becomes necessary to account for the potentially biasing impact of the loss of subjects in the treatment and control conditions of the experiment. The statistical adjustments required to make inferences about treatment effectiveness in such circumstances can introduce uncertainties that are as worrisome as those afflicting nonexperimental and quasi-experimental designs. Thus, the panel's recommendation of the selective use of randomized design carries an implicit caveat: To realize the theoretical advantages offered by randomized experimental designs, substantial efforts will be required to ensure that the designs are not compromised by flawed execution.

Another pitfall to randomization is its appearance of unfairness or unattractiveness to participants and the controversial legal and ethical issues it sometimes raises. Often, what is being criticized is the control of project assignment of participants rather than the use of randomization itself. In deciding whether random assignment is appropriate, it is important to consider the specific context of the evaluation and how participants would be assigned to projects in the absence of randomization. The Federal Judicial Center (1981) offers five threshold conditions for the use of random assignment.

  • Does present practice or policy need improvement?
  • Is there significant uncertainty about the value of the proposed regimen?
  • Are there acceptable alternatives to randomized experiments?
  • Will the results of the experiment be used to improve practice or policy?
  • Is there a reasonable protection against risk for vulnerable groups (i.e., individuals within the justice system)?

The parent committee has argued that these threshold conditions apply in the case of AIDS prevention programs (see Turner, Miller, and Moses, 1989:331-333).

Although randomization may be desirable from an evaluation and ethical standpoint, and acceptable from a legal standpoint, it may be difficult to implement from a practical or political standpoint. Again, the panel emphasizes that questions about the practical or political feasibility of the use of randomization may in fact refer to the control of program allocation rather than to the issues of randomization itself. In fact, when resources are scarce, it is often more ethical and politically palatable to randomize allocation rather than to allocate on grounds that may appear biased.

It is usually easier to defend the use of randomization when the choice has to do with assignment to groups receiving alternative services than when the choice involves assignment to groups receiving no treatment. For example, in comparing a testing and counseling intervention that offered a special "skills training" session in addition to its regular services with a counseling and testing intervention that offered no additional component, random assignment of participants to one group rather than another may be acceptable to program staff and participants because the relative values of the alternative interventions are unknown.

The more difficult issue is the introduction of new interventions that are perceived to be needed and effective in a situation in which there are no services. An argument that is sometimes offered against the use of randomization in this instance is that interventions should be assigned on the basis of need (perhaps as measured by rates of HIV incidence or of high-risk behaviors). But this argument presumes that the intervention will have a positive effect—which is unknown before evaluation—and that relative need can be established, which is a difficult task in itself.

The panel recognizes that community and political opposition to randomization to zero treatments may be strong and that enlisting participation in such experiments may be difficult. This opposition and reluctance could seriously jeopardize the production of reliable results if it is translated into noncompliance with a research design. The feasibility of randomized experiments for AIDS prevention programs has already been demonstrated, however (see the review of selected experiments in Turner, Miller, and Moses, 1989:327-329). The substantial effort involved in mounting randomized field experiments is repaid by the fact that they can provide unbiased evidence of the effects of a program.

Unit of Assignment.

The unit of assignment of an experiment may be an individual person, a clinic (i.e., the clientele of the clinic), or another organizational unit (e.g., the community or city). The treatment unit is selected at the earliest stage of design. Variations of units are illustrated in the following four examples of intervention programs.

Two different pamphlets (A and B) on the same subject (e.g., testing) are distributed in an alternating sequence to individuals calling an AIDS hotline. The outcome to be measured is whether the recipient returns a card asking for more information.

Two instruction curricula (A and B) about AIDS and HIV infections are prepared for use in high school driver education classes. The outcome to be measured is a score on a knowledge test.

Of all clinics for sexually transmitted diseases (STDs) in a large metropolitan area, some are randomly chosen to introduce a change in the fee schedule. The outcome to be measured is the change in patient load.

A coordinated set of community-wide interventions—involving community leaders, social service agencies, the media, community associations and other groups—is implemented in one area of a city. Outcomes are knowledge as assessed by testing at drug treatment centers and STD clinics and condom sales in the community's retail outlets.

In example (1), the treatment unit is an individual person who receives pamphlet A or pamphlet B. If either "treatment" is applied again, it would be applied to a person. In example (2), the high school class is the treatment unit; everyone in a given class experiences either curriculum A or curriculum B. If either treatment is applied again, it would be applied to a class. The treatment unit is the clinic in example (3), and in example (4), the treatment unit is a community .

The consistency of the effects of a particular intervention across repetitions justly carries a heavy weight in appraising the intervention. It is important to remember that repetitions of a treatment or intervention are the number of treatment units to which the intervention is applied. This is a salient principle in the design and execution of intervention programs as well as in the assessment of their results.

The adequacy of the proposed sample size (number of treatment units) has to be considered in advance. Adequacy depends mainly on two factors:

  • How much variation occurs from unit to unit among units receiving a common treatment? If that variation is large, then the number of units needs to be large.
  • What is the minimum size of a possible treatment difference that, if present, would be practically important? That is, how small a treatment difference is it essential to detect if it is present? The smaller this quantity, the larger the number of units that are necessary.

Many formal methods for considering and choosing sample size exist (see, e.g., Cohen, 1988). Practical circumstances occasionally allow choosing between designs that involve units at different levels; thus, a classroom might be the unit if the treatment is applied in one way, but an entire school might be the unit if the treatment is applied in another. When both approaches are feasible, the use of a power analysis for each approach may lead to a reasoned choice.

Choice of Methods

There is some controversy about the advantages of randomized experiments in comparison with other evaluative approaches. It is the panel's belief that when a (well executed) randomized study is feasible, it is superior to alternative kinds of studies in the strength and clarity of whatever conclusions emerge, primarily because the experimental approach avoids selection biases. 7 Other evaluation approaches are sometimes unavoidable, but ordinarily the accumulation of valid information will go more slowly and less securely than in randomized approaches.

Experiments in medical research shed light on the advantages of carefully conducted randomized experiments. The Salk vaccine trials are a successful example of a large, randomized study. In a double-blind test of the polio vaccine, 8 children in various communities were randomly assigned to two treatments, either the vaccine or a placebo. By this method, the effectiveness of Salk vaccine was demonstrated in one summer of research (Meier, 1957).

A sufficient accumulation of relevant, observational information, especially when collected in studies using different procedures and sample populations, may also clearly demonstrate the effectiveness of a treatment or intervention. The process of accumulating such information can be a long one, however. When a (well-executed) randomized study is feasible, it can provide evidence that is subject to less uncertainty in its interpretation, and it can often do so in a more timely fashion. In the midst of an epidemic, the panel believes it proper that randomized experiments be one of the primary strategies for evaluating the effectiveness of AIDS prevention efforts. In making this recommendation, however, the panel also wishes to emphasize that the advantages of the randomized experimental design can be squandered by poor execution (e.g., by compromised assignment of subjects, significant subject attrition rates, etc.). To achieve the advantages of the experimental design, care must be taken to ensure that the integrity of the design is not compromised by poor execution.

In proposing that randomized experiments be one of the primary strategies for evaluating the effectiveness of AIDS prevention programs, the panel also recognizes that there are situations in which randomization will be impossible or, for other reasons, cannot be used. In its next report the panel will describe at length appropriate nonexperimental strategies to be considered in situations in which an experiment is not a practical or desirable alternative.

  • The Management of Evaluation

Conscientious evaluation requires a considerable investment of funds, time, and personnel. Because the panel recognizes that resources are not unlimited, it suggests that they be concentrated on the evaluation of a subset of projects to maximize the return on investment and to enhance the likelihood of high-quality results.

Project Selection

Deciding which programs or sites to evaluate is by no means a trivial matter. Selection should be carefully weighed so that projects that are not replicable or that have little chance for success are not subjected to rigorous evaluations.

The panel recommends that any intensive evaluation of an intervention be conducted on a subset of projects selected according to explicit criteria. These criteria should include the replicability of the project, the feasibility of evaluation, and the project's potential effectiveness for prevention of HIV transmission.

If a project is replicable, it means that the particular circumstances of service delivery in that project can be duplicated. In other words, for CBOs and counseling and testing projects, the content and setting of an intervention can be duplicated across sites. Feasibility of evaluation means that, as a practical matter, the research can be done: that is, the research design is adequate to control for rival hypotheses, it is not excessively costly, and the project is acceptable to the community and the sponsor. Potential effectiveness for HIV prevention means that the intervention is at least based on a reasonable theory (or mix of theories) about behavioral change (e.g., social learning theory [Bandura, 1977], the health belief model [Janz and Becker, 1984], etc.), if it has not already been found to be effective in related circumstances.

In addition, since it is important to ensure that the results of evaluations will be broadly applicable,

The panel recommends that evaluation be conducted and replicated across major types of subgroups, programs, and settings. Attention should be paid to geographic areas with low and high AIDS prevalence, as well as to subpopulations at low and high risk for AIDS.

Research Administration

The sponsoring agency interested in evaluating an AIDS intervention should consider the mechanisms through which the research will be carried out as well as the desirability of both independent oversight and agency in-house conduct and monitoring of the research. The appropriate entities and mechanisms for conducting evaluations depend to some extent on the kinds of data being gathered and the evaluation questions being asked.

Oversight and monitoring are important to keep projects fully informed about the other evaluations relevant to their own and to render assistance when needed. Oversight and monitoring are also important because evaluation is often a sensitive issue for project and evaluation staff alike. The panel is aware that evaluation may appear threatening to practitioners and researchers because of the possibility that evaluation research will show that their projects are not as effective as they believe them to be. These needs and vulnerabilities should be taken into account as evaluation research management is developed.

Conducting the Research

To conduct some aspects of a project's evaluation, it may be appropriate to involve project administrators, especially when the data will be used to evaluate delivery systems (e.g., to determine when and which services are being delivered). To evaluate outcomes, the services of an outside evaluator 9 or evaluation team are almost always required because few practitioners have the necessary professional experience or the time and resources necessary to do evaluation. The outside evaluator must have relevant expertise in evaluation research methodology and must also be sensitive to the fears, hopes, and constraints of project administrators.

Several evaluation management schemes are possible. For example, a prospective AIDS prevention project group (the contractor) can bid on a contract for project funding that includes an intensive evaluation component. The actual evaluation can be conducted either by the contractor alone or by the contractor working in concert with an outside independent collaborator. This mechanism has the advantage of involving project practitioners in the work of evaluation as well as building separate but mutually informing communities of experts around the country. Alternatively, a contract can be let with a single evaluator or evaluation team that will collaborate with the subset of sites that is chosen for evaluation. This variation would be managerially less burdensome than awarding separate contracts, but it would require greater dependence on the expertise of a single investigator or investigative team. ( Appendix A discusses contracting options in greater depth.) Both of these approaches accord with the parent committee's recommendation that collaboration between practitioners and evaluation researchers be ensured. Finally, in the more traditional evaluation approach, independent principal investigators or investigative teams may respond to a request for proposal (RFP) issued to evaluate individual projects. Such investigators are frequently university-based or are members of a professional research organization, and they bring to the task a variety of research experiences and perspectives.

Independent Oversight

The panel believes that coordination and oversight of multisite evaluations is critical because of the variability in investigators' expertise and in the results of the projects being evaluated. Oversight can provide quality control for individual investigators and can be used to review and integrate findings across sites for developing policy. The independence of an oversight body is crucial to ensure that project evaluations do not succumb to the pressures for positive findings of effectiveness.

When evaluation is to be conducted by a number of different evaluation teams, the panel recommends establishing an independent scientific committee to oversee project selection and research efforts, corroborate the impartiality and validity of results, conduct cross-site analyses, and prepare reports on the progress of the evaluations.

The composition of such an independent oversight committee will depend on the research design of a given program. For example, the committee ought to include statisticians and other specialists in randomized field tests when that approach is being taken. Specialists in survey research and case studies should be recruited if either of those approaches is to be used. Appendix B offers a model for an independent oversight group that has been successfully implemented in other settings—a project review team, or advisory board.

Agency In-House Team

As the parent committee noted in its report, evaluations of AIDS interventions require skills that may be in short supply for agencies invested in delivering services (Turner, Miller, and Moses, 1989:349). Although this situation can be partly alleviated by recruiting professional outside evaluators and retaining an independent oversight group, the panel believes that an in-house team of professionals within the sponsoring agency is also critical. The in-house experts will interact with the outside evaluators and provide input into the selection of projects, outcome objectives, and appropriate research designs; they will also monitor the progress and costs of evaluation. These functions require not just bureaucratic oversight but appropriate scientific expertise.

This is not intended to preclude the direct involvement of CDC staff in conducting evaluations. However, given the great amount of work to be done, it is likely a considerable portion will have to be contracted out. The quality and usefulness of the evaluations done under contract can be greatly enhanced by ensuring that there are an adequate number of CDC staff trained in evaluation research methods to monitor these contracts.

The panel recommends that CDC recruit and retain behavioral, social, and statistical scientists trained in evaluation methodology to facilitate the implementation of the evaluation research recommended in this report.

Interagency Collaboration

The panel believes that the federal agencies that sponsor the design of basic research, intervention programs, and evaluation strategies would profit from greater interagency collaboration. The evaluation of AIDS intervention programs would benefit from a coherent program of studies that should provide models of efficacious and effective interventions to prevent further HIV transmission, the spread of other STDs, and unwanted pregnancies (especially among adolescents). A marriage could then be made of basic and applied science, from which the best evaluation is born. Exploring the possibility of interagency collaboration and CDC's role in such collaboration is beyond the scope of this panel's task, but it is an important issue that we suggest be addressed in the future.

Costs of Evaluation

In view of the dearth of current evaluation efforts, the panel believes that vigorous evaluation research must be undertaken over the next few years to build up a body of knowledge about what interventions can and cannot do. Dedicating no resources to evaluation will virtually guarantee that high-quality evaluations will be infrequent and the data needed for policy decisions will be sparse or absent. Yet, evaluating every project is not feasible simply because there are not enough resources and, in many cases, evaluating every project is not necessary for good science or good policy.

The panel believes that evaluating only some of a program's sites or projects, selected under the criteria noted in Chapter 4 , is a sensible strategy. Although we recommend that intensive evaluation be conducted on only a subset of carefully chosen projects, we believe that high-quality evaluation will require a significant investment of time, planning, personnel, and financial support. The panel's aim is to be realistic—not discouraging—when it notes that the costs of program evaluation should not be underestimated. Many of the research strategies proposed in this report require investments that are perhaps greater than has been previously contemplated. This is particularly the case for outcome evaluations, which are ordinarily more difficult and expensive to conduct than formative or process evaluations. And those costs will be additive with each type of evaluation that is conducted.

Panel members have found that the cost of an outcome evaluation sometimes equals or even exceeds the cost of actual program delivery. For example, it was reported to the panel that randomized studies used to evaluate recent manpower training projects cost as much as the projects themselves (see Cottingham and Rodriguez, 1987). In another case, the principal investigator of an ongoing AIDS prevention project told the panel that the cost of randomized experimentation was approximately three times higher than the cost of delivering the intervention (albeit the study was quite small, involving only 104 participants) (Kelly et al., 1989). Fortunately, only a fraction of a program's projects or sites need to be intensively evaluated to produce high-quality information, and not all will require randomized studies.

Because of the variability in kinds of evaluation that will be done as well as in the costs involved, there is no set standard or rule for judging what fraction of a total program budget should be invested in evaluation. Based upon very limited data 10 and assuming that only a small sample of projects would be evaluated, the panel suspects that program managers might reasonably anticipate spending 8 to 12 percent of their intervention budgets to conduct high-quality evaluations (i.e., formative, process, and outcome evaluations). 11 Larger investments seem politically infeasible and unwise in view of the need to put resources into program delivery. Smaller investments in evaluation may risk studying an inadequate sample of program types, and it may also invite compromises in research quality.

The nature of the HIV/AIDS epidemic mandates an unwavering commitment to prevention programs, and the prevention activities require a similar commitment to the evaluation of those programs. The magnitude of what can be learned from doing good evaluations will more than balance the magnitude of the costs required to perform them. Moreover, it should be realized that the costs of shoddy research can be substantial, both in their direct expense and in the lost opportunities to identify effective strategies for AIDS prevention. Once the investment has been made, however, and a reservoir of findings and practical experience has accumulated, subsequent evaluations should be easier and less costly to conduct.

  • Bandura, A. (1977) Self-efficacy: Toward a unifying theory of behavioral change . Psychological Review 34:191-215. [ PubMed : 847061 ]
  • Campbell, D. T., and Stanley, J. C. (1966) Experimental and Quasi-Experimental Design and Analysis . Boston: Houghton-Mifflin.
  • Centers for Disease Control (CDC) (1988) Sourcebook presented at the National Conference on the Prevention of HIV Infection and AIDS Among Racial and Ethnic Minorities in the United States (August).
  • Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences . 2nd ed. Hillsdale, NJ.: L. Erlbaum Associates.
  • Cook, T., and Campbell, D. T. (1979) Quasi-Experimentation: Design and Analysis for Field Settings . Boston: Houghton-Mifflin.
  • Federal Judicial Center (1981) Experimentation in the Law . Washington, D.C.: Federal Judicial Center.
  • Janz, N. K., and Becker, M. H. (1984) The health belief model: A decade later . Health Education Quarterly 11 (1):1-47. [ PubMed : 6392204 ]
  • Kelly, J. A., St. Lawrence, J. S., Hood, H. V., and Brasfield, T. L. (1989) Behavioral intervention to reduce AIDS risk activities . Journal of Consulting and Clinical Psychology 57:60-67. [ PubMed : 2925974 ]
  • Meier, P. (1957) Safety testing of poliomyelitis vaccine . Science 125(3257): 1067-1071. [ PubMed : 13432758 ]
  • Roethlisberger, F. J. and Dickson, W. J. (1939) Management and the Worker . Cambridge, Mass.: Harvard University Press.
  • Rossi, P. H., and Freeman, H. E. (1982) Evaluation: A Systematic Approach . 2nd ed. Beverly Hills, Cal.: Sage Publications.
  • Turner, C. F., editor; , Miller, H. G., editor; , and Moses, L. E., editor. , eds. (1989) AIDS, Sexual Behavior, and Intravenous Drug Use . Report of the NRC Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. [ PubMed : 25032322 ]
  • Weinstein, M. C., Graham, J. D., Siegel, J. E., and Fineberg, H. V. (1989) Cost-effectiveness analysis of AIDS prevention programs: Concepts, complications, and illustrations . In C.F. Turner, editor; , H. G. Miller, editor; , and L. E. Moses, editor. , eds., AIDS, Sexual Behavior, and Intravenous Drug Use . Report of the NRC Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. [ PubMed : 25032322 ]
  • Weiss, C. H. (1972) Evaluation Research . Englewood Cliffs, N.J.: Prentice-Hall, Inc.

On occasion, nonparticipants observe behavior during or after an intervention. Chapter 3 introduces this option in the context of formative evaluation.

The use of professional customers can raise serious concerns in the eyes of project administrators at counseling and testing sites. The panel believes that site administrators should receive advance notification that professional customers may visit their sites for testing and counseling services and provide their consent before this method of data collection is used.

Parts of this section are adopted from Turner, Miller, and Moses, (1989:324-326).

This weakness has been noted by CDC in a sourcebook provided to its HIV intervention project grantees (CDC, 1988:F-14).

The significance tests applied to experimental outcomes calculate the probability that any observed differences between the sample estimates might result from random variations between the groups.

Research participants' knowledge that they were being observed had a positive effect on their responses in a series of famous studies made at General Electric's Hawthorne Works in Chicago (Roethlisberger and Dickson, 1939); the phenomenon is referred to as the Hawthorne effect.

participants who self-select into a program are likely to be different from non-random comparison groups in terms of interests, motivations, values, abilities, and other attributes that can bias the outcomes.

A double-blind test is one in which neither the person receiving the treatment nor the person administering it knows which treatment (or when no treatment) is being given.

As discussed under ''Agency In-House Team,'' the outside evaluator might be one of CDC's personnel. However, given the large amount of research to be done, it is likely that non-CDC evaluators will also need to be used.

See, for example, chapter 3 which presents cost estimates for evaluations of media campaigns. Similar estimates are not readily available for other program types.

For example, the U. K. Health Education Authority (that country's primary agency for AIDS education and prevention programs) allocates 10 percent of its AIDS budget for research and evaluation of its AIDS programs (D. McVey, Health Education Authority, personal communication, June 1990). This allocation covers both process and outcome evaluation.

  • Cite this Page National Research Council (US) Panel on the Evaluation of AIDS Interventions; Coyle SL, Boruch RF, Turner CF, editors. Evaluating AIDS Prevention Programs: Expanded Edition. Washington (DC): National Academies Press (US); 1991. 1, Design and Implementation of Evaluation Research.
  • PDF version of this title (6.0M)

In this Page

Related information.

  • PubMed Links to PubMed

Recent Activity

  • Design and Implementation of Evaluation Research - Evaluating AIDS Prevention Pr... Design and Implementation of Evaluation Research - Evaluating AIDS Prevention Programs

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

IMAGES

  1. 8+ SAMPLE Impact Evaluation Proposal in PDF

    impact evaluation research proposal

  2. FREE 6+ Impact Evaluation Proposal Samples in PDF

    impact evaluation research proposal

  3. PPT

    impact evaluation research proposal

  4. FREE 6+ Impact Evaluation Proposal Samples in PDF

    impact evaluation research proposal

  5. 8+ SAMPLE Impact Evaluation Proposal in PDF

    impact evaluation research proposal

  6. Research proposal evaluation form in Word and Pdf formats

    impact evaluation research proposal

VIDEO

  1. Research insights from Better Work 2017-2022

  2. Effective Research Proposals: Characteristics

  3. Impact Evaluation Essentials: A Closer Look at Quasi-Experimental Methods

  4. Become an evaluator of EU research projects

  5. Paper Session 2: Applications of Complex Systems in Evaluation

  6. Impact Evaluation There is a Different Way to Do It

COMMENTS

  1. Evaluating impact from research: A methodological framework

    A typology of research impact evaluation designs is provided. •. A methodological framework is proposed to guide evaluations of the significance and reach of impact that can be attributed to research. •. These enable evaluation design and methods to be selected to evidence the impact of research from any discipline.

  2. Impact evaluation

    An impact evaluation provides information about the observed changes or 'impacts' produced by an intervention. These observed changes can be positive and negative, intended and unintended, direct and indirect. An impact evaluation must establish the cause of the observed changes. Identifying the cause is known as 'causal attribution' or 'causal ...

  3. Designing an impact evaluation work plan: a step-by-step guide

    This article is the second part of our 2-part series on impact evaluation. In the first article, " Impact evaluation: overview, benefits, types and planning tips," we introduced impact evaluation and some helpful steps for planning and incorporating it into your M&E plan. In this blog, we will walk you through the next steps in the process - from understanding the core elements of an ...

  4. Tips for writing Impact Evaluation Grant Proposals

    Here are some things I find a lot of proposals lack: · Sufficient detail about the intervention - details matter both for understanding whether this is an impact evaluation that is likely to be of broader interest, as well as for understanding what the right outcomes to be measuring are and what the likely channels of influence are.

  5. Assessment, evaluations, and definitions of research impact: A review

    1. Introduction, what is meant by impact? When considering the impact that is generated as a result of research, a number of authors and government recommendations have advised that a clear definition of impact is required (Duryea, Hochman, and Parfitt 2007; Grant et al. 2009; Russell Group 2009).From the outset, we note that the understanding of the term impact differs between users and ...

  6. PDF Impact Evaluation in Practice

    Economic development projects--Evaluation. 2. Evaluation research (Social action programs) I. Gertler, Paul, 1955- II. World Bank. HD75.9.I47 2010 338.90072--dc22 2010034602 ... 14.1 Number of Impact Evaluations at the World Bank by Region, 2004-10 227. Tables. 2.1 Elements of a Monitoring and Evaluation Plan 28

  7. Evaluating impact from research: A methodological framework

    Five types of impact evaluation design are identified encompassing a range of evaluation methods and approaches: i) experimental and statistical methods; ii) textual, oral and arts-based methods ...

  8. PDF OUTLINE OF PRINCIPLES OF IMPACT EVALATION

    Impact evaluation is an assessment of how the intervention being evaluated affects outcomes, whether these effects are intended or unintended. The proper analysis of impact requires a counterfactual of what those outcomes would have been in the absence of the intervention.1. There is an important distinction between monitoring outcomes, which ...

  9. Impact Evaluation in Practice

    The second edition of the Impact Evaluation in Practice handbook is a comprehensive and accessible introduction to impact evaluation for policymakers and development practitioners. First published in 2011, it has been used widely across the development and academic communities. The book incorporates real-world examples to present practical ...

  10. PDF Impact EvaluatIon of DEvElopmEnt IntErvEntIons

    unique opportunity for intersecting research and practice. Impact evaluation can attract some of the world's leading economic talent to engage with specific development projects. Such engagement not only leads to rigorous new evidence on "what works" in development, but also directly enhances project ...

  11. PDF Developing Research Questions for Impact Evaluations

    What is a good impact evaluation research question? Strong research questions are specific, testable, and important. Specific research questions aim to answer a hypothesis, filling a gap in scientific or programmatic knowledge. Testable research questions can be measured using strong methodology. Important research questions, if answered, can ...

  12. PDF Quantitative Methods for Impact Evaluation

    Understand and apply a variety of econometric methods for estimating impact, including randomized controlled trials and quasi-experimental designs ("natural experiments") including regression discontinuity designs, difference-in-differences, synthetic control, and interrupted time series. Critically analyze impact evaluation research in ...

  13. Impact Evaluation Process

    Step 2. Hire an Independent Outside Professional Evaluator. (Federal staff role - Evaluation Manager, Contracting Officer) Develop the Statement of Work (SOW). Develop a Request for Proposal (RFP) for use in select an Evaluation Contractor. Implement the RFP competitive solicitation process to hire an independent evaluator.

  14. Impact Evaluation Designs

    USAID's Project Design Guidance states that: if an impact evaluation is planned, its design should be summarized in the Project Appraisal Document (PAD) section that describes the project's Monitoring and Evaluation Plan and Learning Approach. Early attention to the design for an impact evaluation is consistent with USAID Evaluation Policy requirements for pre-intervention baseline data and a ...

  15. 8+ SAMPLE Impact Evaluation Proposal in PDF

    Cover Page: One of the basic elements of an impact evaluation proposal is the cover page. It is a thought-provoking component of the documentwhich highlights the key concept, or idea of the impact evaluation proposal, as well as to get the attention of the client or the program or policy management to read and accept the proposal. Abstract/Introduction: Specify the impact evaluation goals ...

  16. PDF Impact evaluation

    The impact evaluation model and the self-evaluation form The Ofsted self-evaluation form (SEF) is an opportunity for schools and their partners to demonstrate the positive impact that workforce reform and extended services are making on the lives on children and young people. The impact evaluation model can support this process by helping ...

  17. PDF CALL FOR PROPOSALS Experimental impact evaluations of CGIAR research

    SPIA invites researchers from academic institutions and from CGIAR Centers/CRPs to submit proposals for impact evaluation projects that are based on an experimental design (randomized design or natural experiments). While the project may run beyond December 2016, SPIA funding must be utilized before this period (refer Budget Guidelines for more).

  18. How to Write a Research Proposal

    Research proposal examples. Writing a research proposal can be quite challenging, but a good starting point could be to look at some examples. We've included a few for you below. Example research proposal #1: "A Conceptual Framework for Scheduling Constraint Management" Example research proposal #2: "Medical Students as Mediators of ...

  19. FREE 6+ Impact Evaluation Proposal Samples in PDF

    6+ Impact Evaluation Proposal Samples. Impact assessments frequently include an accountability function to determine the effectiveness of a program. Impact analyses can also assist in determining the most effective course of action among several possibilities when it comes to program design. Impact evaluation is often used for summative objectives.

  20. PDF Impact Evaluation Proposal

    curriculum, impact teacher performance in classrooms and children's learning outcomes? Background/Context: •New competency-based curriculum as part of an overall reform of the system •Training of all 40,000 teachers in July/August 2018: 5 days for KG1-2 •Challenge: buy-in from teachers to implement the new curriculum

  21. PDF PART 1: REVISED TECHNICAL PROPOSAL Date of Submission: Submitted to

    as impact evaluations. For example, an evaluation of a program in Asia highlighted the benefits of 1 Black, Robert E., et al. 2013. "Maternal and child undernutrition and overweight in low-income and middle-income countries." The Lancet, Volume 382, Issue 9890, p 427 - 451 2 Figure taken from the World Bank project document, p.14. July 2014.

  22. Evaluation of research proposals by peer review panels: broader panels

    The rationalities of Glerup and Horst have consequences for which language is used to discuss societal relevance and impact in research proposals. Even though the main ingredients are quite similar, as a consequence of the coexisting rationalities in science, societal aspects can be defined and operationalised in different ways (Alla et al. 2017).

  23. Design and Implementation of Evaluation Research

    Evaluation has its roots in the social, behavioral, and statistical sciences, and it relies on their principles and methodologies of research, including experimental design, measurement, statistical tests, and direct observation. What distinguishes evaluation research from other social science is that its subjects are ongoing social action programs that are intended to produce individual or ...