This article was downloaded by: [University of Sussex Library]
On: 25 February 2012, At: 09:49
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Research Papers in Education
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/rred20
Weight of Evidence: a framework
for the appraisal of the quality and
relevance of evidence
David Gough a
a Institute of Education, University of London, UK
Available online: 01 May 2007
To cite this article: David Gough (2007): Weight of Evidence: a framework for the appraisal of the
quality and relevance of evidence, Research Papers in Education, 22:2, 213-228
To link to this article: http://dx.doi.org/10.1080/02671520701296189
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Research Papers in Education
Vol. 22, No. 2, June 2007, pp. 213–228
ISSN 0267-1522 (print)/ISSN 1470-1146 (online)/07/020213–16
© 2007 Taylor & Francis
DOI: 10.1080/02671520701296189
Weight of Evidence: a framework for the
appraisal of the quality and relevance of
evidence
David Gough*
Institute of Education, University of London, UK
012TRDO02aeR.raG.06iysv1gE07elio0iodaD-0nu8r1rG0a gc05a&JlhoAh/ 2nu0A u @2dnPF2gr e2ar6(Fitha 2opip72rcn9era10lic.r5enn50asi2cst2c 7)ii 2.0n/Lsu.17 sktLE4d0gt7d1md0u2-c91a61t1i4o86n9 (online)
Knowledge use and production is complex and so also are attempts to judge its quality. Research
synthesis is a set of formal processes to determine what is known from research in relation to different
research questions and this process requires judgements of the quality and relevance of the
research evidence considered. Such judgement can be according to generic standards or be specific
to the review question. The judgements interact with other judgements in the review process such
as inclusion criteria and search strategies and can be absolute or weighted judgements combined in
a Weight of Evidence framework. Judgements also vary depending upon the type of review that can
range from statistical meta analysis to meta ethnography. Empirical study of the ways that quality
and relevance judgements are made can illuminate the nature of such decisions and their impact on
epistemic and other domains of knowledge. Greater clarity about such ideological and theoretical
differences can enable greater participative debates about such differences.
Keywords: Knowledge use; Research quality; Research synthesis; Systematic reviews
Introduction
Furlong and Oancea (2006) suggest that there are a number of different domains that
need to be considered when assessing quality in applied and practice-based research;
these domains they describe as the epistemic, the phronetic, and the technical and
economic. For them, quality in applied and practice-based research needs to be
conceptualized more broadly than has conventionally been the case.
However, if research is to be of value in applied contexts, then these issues of
quality cannot be judged only according to abstract generic criteria but must also
include notions of fitness for purpose and relevance of research in answering different
*Institute of Education, University of London, 20 Bedford Way, London WC1H 0AL, UK. Email:
D.Gough@ioe.ac.uk
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
214 D. Gough
conceptual or empirical questions. In other words, question-specific quality and relevance
criteria are used to determine how much ‘Weight of Evidence’ should be given
to the findings of a research study in answering a particular research question.
This contribution addresses these issues with reference to systematic reviews and
systematic research synthesis where a number of studies are considered individually
to see how they collectively can answer a research question. The contribution is principally
concerned with the quality and relevance appraisal of this epistemic knowledge.
Providing greater clarity on how epistemic knowledge is developed and used
can make its role more transparent in relation to the other domains of knowledge
described by Oancea and Furlong.
Systematic synthesis is a set of formal processes for bringing together different types
of evidence so that we can be clear about what we know from research and how we
know it (Gough & Elbourne, 2002; Gough, 2004). These processes include making
judgements about the quality and relevance assessment of that evidence. The contribution
focuses particularly on the systematic methods of research synthesis but
systematic methods of synthesis and arguments of Weight of Evidence can be applied
to the (epistemic) evaluation of all types of knowledge.
Being specific about what we know and how we know it requires us to become
clearer about the nature of the evaluative judgements we are making about the questions
that we are asking, the evidence we select, and the manner in which we appraise
and use it. This then can contribute to our theoretical and empirical understanding
of quality and relevance assessment. The questions that we ask of research are many
and come from a variety of different individual and group perspectives with differing
ideological and theoretical assumptions. In essence, the appraisal of evidence is an
issue of examining and making explicit the plurality of what we know and can know.
The contribution first sets the scene with a brief sketch of how research is just one
of many activities concerned with knowledge production and its appraisal and use.
Second, it introduces evidence synthesis and the crucial role of quality and relevance
assessment in that process to judge how much ‘weight’ should be given to the findings
of a research study in answering a review question. Finally, it is argued that we should
study how judgements of quality are made in practice and thus develop our sophistication
in quality appraisal and synthesis of research and other evidence.
Action, research, knowledge and quality appraisal
We all act on and in the world in different ways and in doing so create different types
of knowledge. The knowledge produced may be relatively tacit or explicit, it can be
used to develop ways of understanding or more directly to inform action with varying
effects, and it can produce ‘capacity for use’ or more direct technological value and
economic and other impacts (Furlong & Oancea, 2006). Particular groups of people
tend to focus on particular activities and produce particular types of products. So
researchers, for example, undertake research to produce knowledge and understanding
and in doing so they probably also produce many other sorts of knowledge.
Working as a researcher can provide experiences ranging from team working with
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 215
colleagues and participants of research to the use of computer software and lead to
organizational and practice knowledge about research (Pawson et al., 2003). All
these different types of knowledge can be used in different ways leading to different
intended and unintended and direct and indirect effects. When there is overt use of
knowledge, this use may include an appraisal of its fitness for purpose.
Figure 1 lists some of these main dimensions of the flow between action and
knowledge production. These can be complex and interactive processes that involve
different psychological and social mechanisms and rely on varying ideological and
theoretical stand points. These different ideologies and theories may be mutually
exclusive or even premised on the need to actively critique the assumptions and
understandings of other perspectives.
Figure 1. Examples of knowledge production and use across different ideological and theoretical standpoints The quality and relevance of all this knowledge can be based on generic criteria or
in relation to some specific criteria and purpose. In relation to generic criteria, any
object might be thought of as high quality because of the materials being used, the
manner in which they have been put together, the beauty of the resulting object, its
fitness for purpose or how form and function combine. For research knowledge, the
research design and its execution is often considered important.
Use-specific criteria may be even more varied. The processes of knowledge creation
and use listed in Figure 1 can be so complex and based on so many different theories
and assumptions that it is difficult to independently determine what the use-specific
criteria should be for assessing quality and relevance of that knowledge. For example,
a policy-maker may have different assumptions about and criteria for evaluating
policy, organizational and research knowledge and may apply knowledge developed
and interpreted within these world views to achieve different physical, social and
economic impacts. They may also use research knowledge to evaluate between policy
choices or to support choices already made (Weiss, 1979).
This complexity provides the background for the focus of this contribution which
is the quality and relevance appraisal of research knowledge. The concern is with the
evaluation of studies in the context of research synthesis that considers all the research
addressing the research questions being asked. Users of research (ranging from
policy-makers to service providers to members of the public) often want to ask what
Actors Knowledge
types
Knowledge
explicitness
Knowledge
use
Knowledge
impact
Researcher Research Physical
Service user Use Social
Practitioner Practice Economic
Policy-maker Policy
Organizational Organizational
Tacit to
declarative
dimension
Knowledge appraisal
Understanding
to action
dimension
Note: Informed by Pawson et al. (2003), Furlong and Oancea (2006).
Figure 1. Examples of knowledge production and use across different ideological and theoretical
standpoints
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
216 D. Gough
we know from all research as a whole rather than just considering one individual
study.
Evidence synthesis
Much of our use of knowledge is to answer such questions as ‘How do we conceptualize
and understand this?’ or ‘What do we know about that?’. We can use what we
know from different sorts of knowledge collected and interpreted in different ways
to develop theories, test theories, and make statements about (socially constructed)
facts.
So how do we bring together these different types of knowledge? Just as there are
many methods of primary research there are a myriad of methods for synthesizing
research which have different implications for quality and relevance criteria. A plurality
of perspectives and approaches can be a strength if it is a result of many differing
views contributing to a creative discussion of what we know and how we know it and
what we could know and how we could know it. The challenge is to develop a
language to represent this plurality to enable debate at the level of synthesis of
knowledge rather than at the level of individual studies.
Systematic evidence synthesis reviews
Before discussing these approaches to reviewing literature, it may be helpful to clarify
two confusing aspects of terminology about research reviews. The first issue is the use
of the term ‘systematic’. With both primary qualitative and quantitative research
there is a common expectation that the research is undertaken with rigour according
to some explicit method and with purpose, method and results being clearly
described. All research is in a sense biased by its assumptions and methods but
research using explicit rigorous methods is attempting to minimize bias and make
hidden bias explicit and thus provide a basis for assessing the quality and relevance of
research findings. For some reason, this expectation of being explicit about purpose
and method has not been so prevalent in traditional literature reviews and so there is
a greater need to specify that a review is or is not systematic. In practice, there is a
range of systematic and non systematic reviews including:
● Explicit systematic. Explicit use of rigorous method that can vary as least as much as
the range of methods in primary research.
● Implicit systematic. Rigorous method but not explicitly stated.
● False systematic. Described as systematic but with little evidence of explicit rigorous
method.
● Argument/thematic. A review that aims to explore and usually support a particular
argument or theme with no pretension to use an explicit rigorous method (though
thematic reviews can be systematic).
● Expert or ad hoc review. Informed by the skill and experience of the reviewer but no
clear method so open to hidden bias.
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 217
● Rapid evidence assessment. A rapid review that may or may not be rigorous and
systematic. If it is systematic then in order to be rapid it is likely to be limited in
some explicit aspect of scope.
The second term requiring clarification is ‘meta-analysis’ which refers to the
combination of results into a new product. Theoretically, meta-analysis can refer to
all types of review but in practice the term has become associated with statistical
meta-analysis of quantitative data. This approach is common in reviews of
controlled trials of the efficacy of treatments in health care. Statistical meta-analysis
is only one form of synthesis with its own particular aims and assumptions. Primary
research varies considerably in aims, methods and assumptions from randomized
controlled trials to ethnographies and single case studies. Similarly, synthesis can
range from statistical meta-analysis to various forms of narrative synthesis which
may aim to synthesize facts or conceptual understandings (as in meta-ethnography)
or both empirical and conceptual as in some mixed methods reviews (Harden &
Thomas, 2005). In this way, the rich diversity of research traditions in primary
research is reflected in research reviews that can vary on such basic dimensions as
(Gough, 2007):
● The nature of the questions being asked.
● A priori or emergent methods of review.
● Numerical or narrative evidence and analysis (confusingly, some use the term
narrative to refer to traditional ad hoc reviews).
● Purposive or exhaustive strategies for obtaining evidence for inclusion.
● Homogeneity and heterogeneity of the evidence considered.
● ‘Empirical’ or ‘conceptual’ data and analysis.
● Integrative or interpretative synthesis of evidence.
To date systematic reviews have only included relatively few types of research question.
Current work by the Methods for Research Synthesis Node of the ESRC
National Centre for Research Methods (see www.ncrm.ac.uk/nodes/mrs/) is examining
the extent of variation in questions posed in primary research across the social
sciences. It is then using this to create a matrix of review questions to consider possible
review methods for each of these questions in order to assist the further development
of synthesis methods.
Stages of a review
The variation in aims and methods of synthesis means that there is not one standard
process but many approaches to reviewing. Many of these include several of the stages
of reviews shown in Figure 2.
Figure 2. Stages of a review This list of stages oversimplifies the diversity of approaches to reviews and does not
apply in reviews with emergent iterative methods, but a brief description of each of
these stages is provided here to allow some understanding of what can be involved in
a review and thus the role of quality and relevance appraisal in this process:
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
218 D. Gough
● Review question. Determining the question being asked and its scope and implicit
assumptions and conceptual framework and thus informing the methods to be
used in the review (sometimes known as the protocol). For example, a review
asking the question ‘What do we know about the effects of travel on children?’
needs to specify what is meant by children, travel and the effects of travel. It also
needs to be clear about the conceptual assumptions implicit in the question that
will drive the methods of the review and the way that it answers the question.
● Inclusion and exclusion criteria. The definition of the evidence to be considered in
addressing the question being asked. This might include, for example, the specification
of the topic and focus, the types of research method, and the time and place
(i) Systematic map of research activity
(ii) Systematic synthesis of research evidence
Formulate review question and develop protocol
Define studies to be considered (inclusion criteria)
Search for studies (search strategy)
Screen studies (check that meet inclusion criteria)
Describe studies (systematic map of research)
All the stages of a map plus:
Appraise study quality and relevance
Synthesise findings (answering review question)
Communicate and engage
Figure 2. Stages of a review
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 219
that the research was undertaken. In a review with an emergent iterative method
the inclusion criteria may not become fully clear until the later stages of the review.
● Search strategy. The methods used to identify evidence meeting the inclusion and
exclusion criteria. This might include, for example, methods of searching such as
electronic and hand searching and sources to search such as bibliographic databases,
websites, and books and journals. Searching also varies in whether it is
aiming to be exhaustive. Other strategies include sampling studies meeting the
inclusion criteria, searching until saturation where no extra information is being
provided by further studies, or for the search to be more iterative and explorative.
● Screening. Checking that the evidence found does meet the definitional criteria
for inclusion. In searching electronic bibliographic databases, the majority of
papers identified may not be on the topic or meet the other inclusion criteria for
the review. For example, a search strategy on children and travel may identify
studies on adult issues concerning travel with children rather than the effects of
travel on children.
● Mapping. Describing the evidence found and thus mapping the research activity.
Such maps are an important review product in their own right in describing a
research field. They can also inform a synthesis review by allowing a consideration
of whether all or part of the map best answers the review question and should be
synthesized in a two-stage review. For example, a map of research on the effect of
travel on children may include all types of travel and all types of effect but the indepth
review and synthesis might narrow down to examine the effect of different
modes of travel to school on exercise, food intake, cognition, social mixing, and
knowledge of local environments. This would exclude the effects of most long
distance travel, non-school travel and many other effects of travel such as safety,
pollution, and self-determination in travel (Gough et al., 2001). The synthesis
might also be limited to the types of research method thought to best address the
review question.
● Data extraction. More detailed description of each piece of evidence to inform quality
and relevance assessment and synthesis. The data extracted may include basic
descriptive information on the research, the research results and other detailed
information on the study to inform quality and relevance appraisal to judge the
usefulness of the results for answering the review question.
● Quality and relevance appraisal. Evaluating the extent that each piece of the
evidence contributes to answering the review question. Even if a study has met the
initial inclusion criteria for the review it may not meet the quality and relevance
standards for the review.
● Synthesis. Aggregation, integration and interpretation of all of the evidence considered
to answer the review question.
● Communication, interpretation and application of review findings.
The processes of systematic reviewing are explicit methods for bringing together
what we know and how we know it. This not only provides accessibility for all users
of research to research findings, it also provides an opportunity for users of research
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
220 D. Gough
to be involved in the ways in which the reviews were undertaken, including the
conceptual and ideological assumptions and the questions being asked, and so
provides a way for these users to become actively involved in the research agenda.
This approach provides a means by which there can be greater democratic participation
in the research process that is largely under the control of research funders and
academics. They can also be explicitly involved in deliberative processes of involving
other factors and knowledge in interpreting and applying the research findings
(Gough, in press).
Quality and relevance assessment
Stage of review for study appraisal
In order to synthesize what we know from research, we need to ensure that the
evidence is of sufficient and appropriate quality and relevance. In the stages of a
review described in Figure 2, quality and relevance assessment occurs between
mapping and synthesis. In some approaches to synthesis the type of evidence to be
included or excluded in the review might be considered to be an issue of quality and
thus part of the application of inclusion and exclusion criteria in the early stages of a
review.
Even if the actual process of assessment occurs at a later stage of the review process
it can be considered a form of inclusion criteria. The reason for occurring later in the
process may simply be because it is only after mapping or data extraction that there
is sufficient information available to make the assessment. Also, when the assessment
is made, it may not be an all or none decision of inclusion but one of weighting studies
in terms of quality and relevance and thus the extent that their results contribute to
the synthesis.
In other cases, the quality and relevance assessment can only occur later in the
process because they occur at the same time as synthesis. One example is a technique
called sensitivity analysis (Higgins et al., 2006). This is a process where the effect of
including or excluding lower quality studies is assessed. If the effect is minimal the
studies may be included in the final results of the review.
Another example of quality assessment at the synthesis stage occurs in some of the
more interpretative types of synthesis. In this case, quality and relevance assessment
is an integral part of the process of synthesis, where the value of a piece of evidence is
assessed according to what extra it contributes to the synthesis (for example in realist
synthesis, Pawson, 2006).
Taking all these issues together, there are at least the following ways in which
the assessment of quality and relevance can occur in the process of a synthesis
review:
1. Initial exclusion criteria. Exclusion of types of evidence at the start of the review
process: the exclusion of certain studies on the basis of their evidence type or
very basic aspects of quality of the study. For example, the inclusion of only
ethnographic studies or only randomized controlled trials. This narrow approach
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 221
to included research designs may exclude studies with non-ideal designs for
addressing the review question but these excluded studies might still contain
useful information.
2. Mapping stage narrowing of criteria. In a two-stage review it is possible to at first
include a wider group of designs and then to use the mapping stage as an opportunity
to examine the whole field of research and then to maybe narrow down to
a sub-set of the studies. An alternative strategy is to include a wide group of
designs all the way through to the synthesis but to use methods of quality and
relevance to deal with this heterogeneity (as in 3, below).
3. Detailed appraisal. Detailed appraisal of the quality or relevance of the study prior
to synthesis often undertaken after detailed data extraction as this provides the
necessary detailed information for the assessment of studies. This can be: (i)
exclusion of studies not meeting the criteria and so similar to inclusion/exclusion
criteria; (ii) weighted inclusion of studies assessed as non optimum on a priori
quality or relevance criteria: to allow studies to have an impact on the conclusions
of the review.
4. Emergent criteria. Inclusion, weighted inclusion, or exclusion of studies on basis
of emergent criteria that the studies answer the review question. This is similar to
a priori criteria for assessing studies (as in 1, 2 or 3) but based on emergent assessment
of the contribution to answering the review question (just as relevance of
different types of data might only emerge during the process of some qualitative
process studies) (Pawson, 2006).
5. Sensitivity analysis. Studies included or excluded on the basis of quality and
relevance appraisal and the impact on the conclusions of the synthesis. Studies
considered problematic may be included as long as they do not change the
conclusions provided by other studies.
Weight of Evidence framework
In addition to the variation in where quality and relevance assessments fit within the
stages of reviews there is the issue of whether generic or review-specific quality
appraisal judgements are being made.
As discussed in the first section of this contribution, the concept of quality is
complex and whatever the nature of the criteria applied, these can refer to more
generic (or intrinsic) or more narrowly purpose- and context-specific judgements. A
research study can therefore be assessed against more generic criteria of quality and/
or against some more purpose-specific criteria.
In a systematic review, a research study judged as high quality against generic criteria
may not necessarily be a good study in the sense of being fit for purpose in answering
the review question. The authors of the original primary study may have executed
the study perfectly, but they undertook the study before the review took place and
could not be expected to know the particular focus of any potential future review.
The generic form of appraisal thus considers whether a study (included in the
review as meeting the inclusion criteria) is well executed, whether or not it is useful
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
222 D. Gough
in answering the review question. Such appraisal is likely to be based on whether the
study is fit for purpose in a generic way in the sense that the results of such studies
performed in such ways can be trusted but it does not require any consideration of
the quality or relevance of a research study for a particular research review. It is thus
a ‘non review specific’ judgement.
Review question-specific judgements consider the extent that a study is fit for
purpose as a piece of evidence in addressing the question being considered by a
specific review. In other words, however well executed, does the study help to answer
the review question?
A first dimension of review-specific quality and relevance is the type of research
evidence being employed. A study may be very good of its kind but use a research
design that is not powerful at answering the review question. For example, a randomized
controlled trial is very appropriate for answering questions about the extent of
the efficacy of interventions, but unless it has also included process data, it will not
be so good at answering questions of process or of the prevalence or extent of a
phenomenon. Pre-post non-controlled designs are not as efficient as controlled trials
at addressing questions of the extent of the efficacy of interventions but are often
undertaken to answer such questions due to resource constraints or other reasons
(even if in the long run it may be more expensive to undertake cheaper but inconclusive
studies). On the other hand, descriptive analytic studies can be very powerful for
addressing issues of process but not of extent of effect.
In some reviews there are very narrow inclusion criteria about research design so
only some specific designs will be included in the review. For example, metaethnographic
reviews are only likely to include ethnographic primary research studies,
whilst statistical meta-analytic studies of effect may only include controlled quantitative
experimental studies.
If there are not narrow inclusion criteria on research design and a wider range of
designs is included for consideration, then there are issues about the relative extent
that the designs of each of these studies are of sufficient fitness for purpose to be
included in the synthesis in a full or weighted form.
This distinction between generic quality of execution and the appropriateness of
the research design for addressing the review question avoids the confounding of
these different concepts found in many available schemas and checklists for addressing
research study quality. For example, a review asking a ‘what works?’ question
about the efficacy of an intervention may only include randomized controlled trials.
However, Slavin (1984, 1995) has criticized some reviews for including poorly
executed randomized controlled designs whilst omitting good quality non-random
designs. We need a framework that allows the reviewer to make explicit decisions on
these two separate dimensions of quality of execution and appropriateness of design
to answer the review question. Reviewers can thus take a broader approach and
include all designs and, if they wish, give less emphasis to the results of some designs
over others.
A second dimension of review-specific quality and relevance assessment is the
topic focus or context of the evidence. Topic and context can (just like research
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 223
design) be an inclusion/exclusion criterion for a review and they can also be part of
the quality and relevance appraisal later in the review process. If they are part of
quality and relevance appraisal later in the review, then it can be a weighted judgement
allowing for a broad range of evidence to be considered that varies depending
on how directly it addresses the focus of the review question. For example, a review
might only include studies from the UK because studies in other countries may be
undertaken in different contexts. Alternatively, the review might include studies
from other countries and treat them equally or might include them and weight them
lower due to the different context. Similar judgements can be made about many
aspects of the studies such as the sample, the definition of what is being studied, the
context and the study measures. For example, a review might want to include all the
research on a topic whatever the research design being used even though those
different designs may differ in their ability to answer the review question and may
require different types of issues to be considered in rating their quality and relevance.
As already discussed in respect of study design, a system of weighting allows
for a review to employ a broader question and thus broader inclusion criteria in the
knowledge that weighted judgements can be applied to the broader range of
evidence identified.
Weight of Evidence is a concept used in several fields (including law and statistics)
referring to the preponderance of evidence to inform decision-making. It is a useful
heuristic for considering how to make separate judgements on different generic and
review-specific criteria and then to combine them to make an overall judgement of
what a study contributes to answering a review question. These can create a Weight
of Evidence framework of one generic judgement (Weight of Evidence A), one
review-specific judgement of research design (Weight of Evidence B) and one reviewspecific
judgement of evidence focus (Weight of Evidence C) and an overall judgement
(Weight of Evidence D) (Gough, 2004):
● Weight of Evidence A. This is a generic and thus non review-specific judgement
about the coherence and integrity of the evidence in its own terms. That may be
the generally accepted criteria for evaluating the quality of this type of evidence by
those who generally use and produce it.
● Weight of Evidence B. This is a review-specific judgement about the appropriateness
of that form of evidence for answering the review question, that is the fitness for
purpose of that form of evidence. For example, the relevance of certain research
designs such as experimental studies for answering questions about process.
● Weight of Evidence C. This is a review-specific judgement about the relevance of the
focus of the evidence for the review question. For example, a research study may
not have the type of sample, the type of evidence gathering or analysis that is
central to the review question or it may not have been undertaken in an appropriate
context from which results can be generalized to answer the review question. There
may also be issues of propriety of how the research was undertaken such as the
ethics of the research that could impact on its inclusion and interpretation in a
review (Pawson et al., 2003).
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
224 D. Gough
These three sets of judgements can then be combined to form an overall assessment
Weight of Evidence D of the extent that a study contributes evidence to answering a
review question.
The literature contains a number of other frameworks for assessing quality of
research that can be used for systematic reviews, many of which can be incorporated
within the Weight of Evidence framework (see Harden, in press). One example is
TAPUPAS that lists seven dimensions to assess research on: accuracy, purposivity,
utility, propriety, accessibility and specificity (Pawson et al., 2003). The way in which
TAPUPAS overlaps with and draws attention to issues that can be included within
the Weight of Evidence framework is shown in Figure 3.
Figure 3. Fit between TAPUPAS dimensions and the Weight of Evidence framework
Quality and relevance appraisal of reviews
The discussion so far has focused on the appraisal of individual primary research
studies for inclusion in reviews. There is also the issue of the appraisal of reviews. The
same Weight of Evidence framework can be used for appraising reviews as for
appraising individual studies but the specific issues and criteria will vary with the aims
and methods of the review. An increasing diversity of reviews is emerging and with
this a range of accepted practices that will inform judgements about:
● WoE A: generic issues about quality of the execution of a review such as being
explicit and transparent.
● WoE B: review-specific issues about the particular review design employed and its
relevance to the review question. For example, a statistical meta-analysis might
not provide much useful information about the processes of an educational
intervention.
● WoE C: review-specific issues about the focus of the review. For example, a
narrowly focused review might not provide much breadth about the research
knowledge relevant to answering a review question.
Weight of Evidence A: Generic on quality of execution of study
Transparency: clarity of purpose
Accuracy: accurate
Accessibility: understandable
Specificity: method-specific quality
Weight of Evidence B: Review specific on appropriateness of method
Purposivity: fit for purpose method
Weight of Evidence C: Review specific on focus/approach of study to review question
Utility: provides relevant answers
Propriety: legal and ethical research
Figure 3. Fit between TAPUPAS dimensions and the Weight of Evidence framework
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 225
Such Weight of Evidence appraisals can be used for appraising an individual review
or a number of reviews for a review of reviews.
Empirical study of quality
These distinctions on quality, relevance, and weight of evidence provide a structure
for making judgements but do not explain how the specific judgements should be
made. In order to be systematic, a review needs to specify how the different judgements
of quality and relevance were made about each study and how these generic
and review-specific components have been combined to provide an overall judgement
of what each study can or cannot contribute to answering the overall review question.
One strategy is a priori to define how these judgements should be made across the
social sciences. Some progress could be made using this strategy but judgements of
evidence quality and relevance are highly contested and progress on developing
agreement might be slow.
Another, complementary strategy is to examine how people make these judgements
in practice thus making explicit the often implicit ideas about quality and relevance
so that these can be shared, debated and refined. This is the strategy that has been
applied in the EPPI-Centre at the Institute of Education, University of London
(http://eppi.ioe.ac.uk) which has supported well over twenty review groups in undertaking
over fifty reviews for the Department of Education and Skills and Teacher
Training and School Development Agency (see also Oakley, 2003).
The majority of the reviews undertaken to date have concerned issues of effectiveness
that considered experimental evidence to have most weight, although several
teams did not distinguish between randomized controlled trials and quasi experimental
and non-controlled trials. Some review teams have stated that they gave equal
weight to the generic and two review-specific ratings and then took an average score
to rate overall Weight of Evidence. On the other hand, other teams have stated that
they prioritized generic research quality (WoE A), and others have stated they prioritized
focus of the study over other considerations (WoE C). In examining a group of
reviews, Oakley (2003) reports that the review authors rated many studies to be of
low overall quality.
These judgements can be considered in more detail by examining 518 primary
research studies included in these reviews. For most studies (363 studies = 70%) the
rating of execution of study (WoE A) and overall rating (WoE D) were the same. This
suggests that the choice of method, its execution and the focus of methods were
equally important to the review authors.
For nearly a third of studies (155 studies = 30%) they were different indicating that
review-specific issues (WoE B and C) had influenced the overall rating (WoE D).
Table 1 shows that for the majority of these cases (116 studies = 73% of the 155 studies),
the review-specific ratings (WoE B and C) had lowered the overall rating. For
the remaining studies (39 = 25% of the 155 studies) the review-specific ratings (WoE
B and C) had resulted in higher overall ratings (WoE D). This suggests that when
review-specific issues are important they are more likely to reduce than increase the
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
226 D. Gough
overall ratings of studies. Table 1 also shows that when review-specific criteria
effected the overall score then this was more likely (30% compared to 14%) to be due
to the effect of the relevance of the focus of the study (WoE C) rather than the choice
of study design (WoE B) for both lowering (32% to 13%) and raising (26% to 15%)
the overall rating.
The next stage is to examine in detail the processes by which these review teams
made and justified these assessments. This information is too detailed to be included
in most summary reports but can be included in full technical reports. At the EPPICentre,
for example, full technical reports contain specific headings to ensure that the
main methodological issues in undertaking a review are addressed including the
manner in which the review teams justified their Weight of Evidence judgements.
This strategy of making explicit the ways in which review questions relate to
appraisal of evidence enables conscious consideration of methodological decisionmaking
and fit for purpose evaluation of quality of studies in answering different
review questions.
Conclusion
This contribution opened with a brief discussion of how all different types of human
activity produce different types of tacit and explicit knowledge, that is understood and
used in different ways by people with very differing ideological and conceptual standpoints
to develop theories and empirical statements about the world. This variation
creates immense complexity for the evaluation of the quality of different types of
knowledge but this diversity can be managed and understood by reference to the
world views of those creating and evaluating this knowledge and their reasons for
undertaking such judgements.
The contribution then introduced evidence synthesis as a means of bringing
together what is known in relation to any conceptual or empirical question and
enabling the full range of users of research to be involved in this process. This can
involve quality and relevance assessment of the research studies at various stages of a
review. Despite variations in how such assessments are made there is a distinction
between generic judgements of evidence quality according to generally accepted criteria
(within that approach to evidence) and review-specific evaluations based on the
fitness for purpose of the review. The Weight of Evidence framework helps to clarify
the judgements that are being used in evaluating evidence by enabling explicit
Table 1. Weight of Evidence judgements on 155 of the 518 studies where different ratings were
given to WoE A and WoE
B=C % B>C B
A
A>D 64 55% 15 13% 37 32% 116 100%
Total 87 56% 21 14% 47 30% 155 100%
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
Weight of Evidence 227
decisions to be made on three dimensions of generic method, review-specific method,
and review-specific focus and context of the study. This approach can be applied to
individual studies, whole reviews, reviews of reviews, and to any quality and relevance
appraisal process. Being explicit about these quality and relevance judgements then
allows the empirical study of how these decisions are being made in practice so that
we can assess and develop how we make these judgements. Ultimately this could
provide the focus for the development of fit for purpose research methods.
In a sense, this approach is an epistemic strategy for making explicit how we identify,
appraise for quality and relevance, and synthesize evidence. This should allow more
open debate about how we make these decisions. This is not a strategy to mechanize
or to take the value out of or in any way constrain these judgements beyond asking
them to be made explicit. On the contrary, the purpose is to make the judgements
more transparent so that they can be considered and debated by all. This can clarify
what decisions are made on the basis of research evidence rather than other important
factors such as values and resources. It can make explicit and open for debate the often
implicit values and other assumptions on which the research was based and on which
its quality is being appraised. It can highlight the values and perspectives behind
research and encourage a greater range of people in society to engage in determining
what questions are asked, the evidence used to answer those questions, and the values
implicit in these processes. It can enable a democratic process both in terms of access
to knowledge and also participation in its creation and use. The more that we involve
a full range of users and potential beneficiaries of research in this process, the more
that we will develop a plurality of knowledge creation and use (Gough, in press).
Acknowledgements
I wish to thank the editors of this volume, John Furlong and Alis Oancea, and Ann
Oakley and Angela Harden for their very helpful comments on earlier drafts of this
contribution.
References
Furlong, J. & Oancea, A. (2006) Assessing quality in applied and practice-based research in education:
a framework for discussion, Review of Australian Research in Education: Counterpoints on
the Quality and Impact of Educational Research, 6, 89–104.
Gough, D. A. (2004) Systematic research synthesis, in: G. Thomas & R. Pring (Eds) Evidence-based
practice in education (Buckingham, Open University Press), 44–62.
Gough, D. (2007) Typology of reviews, poster presented at the NCRM Conference, Manchester.
Gough, D. (in press) Giving voice: evidence-informed policy and practice as a democratizing
process, in: M. Reiss, R. Depalma & E. Atkinson (Eds) Marginality and difference in education
and beyond (London, Trentham Books).
Gough, D. & Elbourne, D. (2002) Systematic research synthesis to inform policy, practice and
democratic debate, Social Policy and Society, 1(3), 225–236.
Gough, D. A., Oliver, S., Brunton, G., Selai, C. & Schaumberg, H. (2001) The effect of travel modes
on children’s mental health, cognitive and social development; a systematic review. Report for
DETR. EPPI Centre, Social Science Research Unit.
Downloaded by [University of Sussex Library] at 09:49 25 February 2012
228 D. Gough
Harden, A. (in press) ‘Qualitative’ research, systematic reviews and evidence-informed policy and
practice. Unpublished Ph.D. thesis, University of London.
Harden, A. & Thomas, J. (2005) Methodological issues in combining diverse study types in
systematic reviews, International Journal of Social Research Methods, 8, 257–271
Higgins, J. P. T. & Green, S. (2006) Cochrane handbook for systematic reviews of interventions 4.2.6
[updated September 2006]. Section 8.10. Available online at: www.cochrane.org/resources/
handbook/hbook.htm (accessed 22 January 2007).
Oakley, A. (2003) Research evidence, knowledge management and educational practice: early
lessons from a systematic approach, London Review of Education, 1, 21–33.
Pawson, R. (2006) Evidence-based policy: a realist perspective (London, Sage).
Pawson, R., Boaz, A., Grayson, L., Long, A. & Barnes, C. (2003) Types and quality of knowledge in
social care. Knowledge review 3 (London, Social Care Institute of Excellence).
Slavin, R. E. (1984) Meta-analysis in education. How has it been used?, Educational Researcher,
13(8), 6–15.
Slavin, R. E. (1995) Best evidence synthesis: an intelligent alternative to meta-analysis, Journal of
Clinical Epidemiology, 48(91), 9–18.
Weiss, C. H. (1979) The many meanings of research utilization, Public Administration Review, 39,
426–431.
Downloaded by [University of Sussex Library] at 09:49 25 February 2012

Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 5
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
The politics of evidence and methodology:
lessons from the EPPI-Centre
Ann Oakley, David Gough, Sandy Oliver and James Thomas
English The ‘evidence movement’ poses particular challenges to social science. This article describes
and reflects on some of these challenges, using as a case study the development over the period
from 1993 of the Evidence for Policy and Practice Information and Coordinating (EPPI-Centre) at
the Institute of Education, University of London. The article describes the development of EPPICentre
resources for supporting evidence-informed policy in the fields of education and health
promotion, and in so doing explores many of the challenges that confront research reviewers in the
social sciences. These include technical and methodological issues affecting the quality and reporting
of primary research, and the retrieval quality of bibliographic databases; and wider factors such as
the culture of academia, and research funding practices that militate against the building of a cumulative
evidence base.
Français Le « mouvement de pratique basée sur les preuves » pose des défis particuliers à la
science sociale. Cet article décrit et réfléchit sur certains de ces défis, à l’aide de l’étude de cas sur
la période débutant en 1993, concernant le développement des preuves pour l’information et la
coordination des politiques générales et pratiques (Evidence for Policy and Practice Information –
EPPI) à l’Institut de l’éducation de l’université de Londres. L’article décrit le développement des
ressources du Centre EPPI, qui soutient les politiques générales basées sur les preuves, dans les
domaines de l’éducation et de la promotion de la santé, et ce faisant explore nombre des défis
auxquels sont confrontés les vérificateurs de recherche dans le secteur des sciences sociales. Ces
défis incluent des questions techniques et méthodologiques qui touchent la qualité et la réalisation
de compte-rendus de la recherche primaire, ainsi que la qualité d’extraction des bases de données
bibliographiques ; par ailleurs, des facteurs plus larges tels que la culture du milieu universitaire et
les pratiques de financement de la recherche, qui militent contre l’établissement d’une base de
preuves par accumulation, sont abordés.
Español El “movimiento de evidencia” plantea desafíos detallados de las ciencias sociales. Este
artículo describe y reflexiona acerca de algunos de estos desafíos, utilizando como caso de estudios
el desarrollo en el periodo de 1993 de la Evidencia Política y Práctica de Información y Coordinación
(EPPI-Centro) en el Instituto de Educación de la Universidad de Londres. El artículo describe el
desarrollo de los recursos de EPPI-Centro para apoyar la política de evidencia informativa en los
campos de la educación y la promoción de la salud y así explora muchos de los desafíos que
confrontan los críticos de investigación en las ciencias sociales. Esto incluye temas técnicos y
metodológicos que afectan la calidad e información de la investigación elemental, y la calidad de
recuperación de base de datos bibliográficos y factores más amplios tales como la cultura de
academia y las prácticas de fondos de investigación que milita en contra de la construcción de una
base de evidencia cumulativa.
Key words evidence-informed policy • social science • methodology • systematic reviews
Key dates final submission 12 April 2004 • acceptance 12 May 2004
research
© The Policy Press • 2005 • ISSN 0547 4378
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
6 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
The current ethos of ‘evidence-based’ or ‘evidence-informed’ public policy in the
UK poses many challenges. This article describes how these challenges have been
met using as a case study the evolution of one particular academic resource for
promoting evidence-informed policy, that of the Evidence for Policy and Practice
Information and Coordinating (EPPI) Centre at the University of London Institute
of Education. The aim of the article is to describe the genesis of the EPPI-Centre
within the broad social context of the ‘evidence movement’, to examine the micropolitics
of factors influencing the formation of social science and policy initiatives
in the field of evidence-informed policy, and to extrapolate some of the main ongoing
challenges.
The EPPI-Centre’s main business is conducting and supporting systematic reviews
of social science research relevant to different areas of public policy. “Targeted databases
of systematic literature reviews” (Smith, 1996, p 379) are the key methodology of
an evidence-informed society, since what is required is the development of open,
transparent, quality-assured and systematic methods for applying knowledge to the
production of knowledge (CERI, 2000). The EPPI-Centre provides a system for
conducting, storing and disseminating systematic reviews relevant to public policy;
searchable registers of reviews and primary research; training programmes to build
capacity for undertaking systematic reviews; and web-based systems for national
and international collaboration. The centre has formal links to the Cochrane and
Campbell Collaborations, the two major international networks for conducting
systematic research synthesis in health care and social and educational policy.
The EPPI-Centre’s work started in 1993 with a very modest project conducted
in a context where there were few other key players. The climate then in the UK
was largely one of ignorance about the importance of systematic, quality-assured
social science resources for informing public policy. In 2004, by contrast, there is
substantial activity in this area, and much more awareness of the contribution
systematically collected and synthesised social science research evidence can make
to policy and practice across a range of topics.
The background to the ‘evidence movement’
The ‘evidence movement’ is the outcome of a complex web of factors involving
practitioners, the policy-making process, the general global shift to a ‘knowledge
economy’, and changes in the balance of power relations between the creators and
recipients of public policy and related services. The notion that social science ought
to be useful to public policy in the sense of scientifically planned social change is as
old as social science itself (Abrams, 1968; Finch, 1986; Hammersley, 1995). However,
the relationship between research and policy has long been recognised as complex:
variations on Janowitz’s (1972) distinction between ‘engineering’ and ‘enlightenment’
models abound, most of these admitting that research findings are likely to be only
one factor on the policy maker’s horizon (Blume, 1979; Weiss with Bucuvalas, 1980;
Heinemann et al, 1990). Keynes’s famous remark to the effect that government’s
task would be much easier without the complication of research evidence (Skidelsky,
1992, p 630) is another way of saying that the barriers to using research in policy
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 7
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
making are not irrational but, rather, reasonable responses to multiple levels of
accountability.
Late 20th and early 21st-century definitions of ‘evidence’ fall into this untidy
space. They may even encourage a spurious tidiness since, as Nutley and Webb (2000,
p 34) note, “The idea of evidence-based policy and practice fits most naturally with
rational models of the policy-making process”. ‘Evidence-based everything’ has led
to government proclamations of the need for evidence in quarters where it has
scarcely been glimpsed before, such as policing and even defence (Oughton, 2002).
Evidence is, broadly, research findings: the results of “systematic investigation towards
increasing the sum of knowledge” (Davies et al, 2000a, p 3). The Oxford English
Dictionary’s definition of ‘evidence’ is “the available body of factors or information
indicating whether a belief or proposition is true or valid”; validity, as well as availability,
is thus a key issue. Whose beliefs and propositions, and to whom the results of testing
these are available, are central questions that have driven much of the EPPI-Centre’s
work. The background to the recent growth of a concern about evidence includes
the modernising, anti-ideology tone of government, especially in the UK, which,
since the late 1990s, has emphasised the need for policy makers to know what
works in a context of resource-efficiency (Strategic Policy Making Team, 1999;
Young et al, 2002). There is also generally increased scepticism about the value of
professional expertise; evidence-informed practice is, in this sense, not the bedfellow,
but the driver of evidence-informed policy. The enormous growth of information
technologies has made public access to very large amounts of data quick, easy and
virtually free at the point of use. Another factor is that changes in the structure and
function of higher education institutions are transforming these into business
enterprises, dependent for survival on their ability to do the kinds of work that
government, among other agencies, wants done. There are new incentives to move
away from the old ideas of scholarship and ‘basic’ research towards more ‘applied’
policy-relevant activities.
The healthy origins of evidence-informed policy
The present era of evidence-informed policy in the UK begins with a paradox.
Despite a respectable social science history of systematic research synthesis and
experimental evaluation of policy initiatives (see, for example, Rivlin, 1971; Rossi
and Williams, 1972; Oakley, 2000), the current interest in, and antipathy towards,
evidence-informed public policy has its roots in health care. In 1972, Archie Cochrane,
the British epidemiologist, laid down a fundamental challenge to medicine in his
book Effectiveness and efficiency. Cochrane’s criticism was that medicine had not
organised its knowledge in any systematic, reliable and cumulative way; the result
was chaotic, individualistic, often ineffective and sometimes harmful, patterns of
care. This challenge resulted in the setting up in October 2003 of an international
collaboration named after him, devoted to preparing, maintaining and promoting
the accessibility of systematic reviews of the effects of health care interventions.
The two main methodological principles of the Cochrane Collaboration are the
need for unbiased comparisons of different interventions (‘randomised controlled
trials’) and the importance of collating, and constantly updating, evidence from
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
8 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
different studies in order to arrive at the most reliable estimates of effect. These
principles build on substantial bodies of empirical research demonstrating that
misleading policy and practice conclusions may be drawn from evaluations of
interventions that do not take steps to minimise bias, and from reviews of research
evidence that do not attempt to locate as many relevant studies as possible, and/or
which draw conclusions on the basis of samples which are too small to be reliable
(see, for example, Antman et al, 1992; Chalmers and Altman, 1995; Schulz et al,
1995; Moynihan, 2004).
The terminology varies: ‘systematic reviews’, ‘research synthesis’ and ‘meta-analysis’
are all used to refer to the activity of searching for, critically appraising and collating
all available relevant studies on a given topic. ‘Systematic review’ is the most commonly
used term today, although ‘research synthesis’ was favoured by the social scientists
who pioneered the practice (McCall, 1923; Glass et al, 1981; Light and Pillemer,
1984; see also Chalmers et al, 2002, for a discussion). In its social science origins, the
methods of research synthesis were very much an academic response to the same
sort of puzzlement as in health care about what to do in the face of conflicting
results from different studies (Hunt, 1997, p 3). Systematic reviews bring together
the findings of many different research studies in a way that is explicit, transparent,
replicable, accountable and updateable. They differ from traditional literature reviews
which commonly focus on the range and diversity of primary research using a selective,
opportunistic and discursive approach with no clear audit trail leading from the
primary research to the conclusions of the review (Jackson, 1980). The principles of
systematic reviews are simple: specifying particular, answerable research questions,
and criteria about what kinds of studies to include in the domain of literature to be
surveyed; clarity about which literatures will be searched for relevant studies, and
how; making explicit, justifiable decisions about the methodological quality of studies
regarded as generating reliable findings; and some method of integrating the findings
of individual, good quality studies (Cooper and Hedges, 1994a; Chalmers and Altman,
1995; Badger et al, 2000; Davies, 2000).
The National Perinatal Epidemiology Unit (NPEU), under the directorship of
Iain Chalmers, was the nursery of the Cochrane Collaboration and operated as a
multidisciplinary unit with input from medically trained epidemiologists, economists,
information specialists, social scientists and others (Chalmers, 1991a). This meant
that the applicability of the principles underlying systematic reviews of wellconducted
evaluations to other areas of policy and practice beyond health care was
an obvious subject for debate. One of us (Ann Oakley) was employed as a social
scientist at the NPEU from 1979 to 1984 and so had an opportunity to take part in
these debates. They provided a forum matching the description offered by the wellknown
epistemologist, Donald Campbell, of a “disputatious community of ‘truthseekers’”
as the context for establishing the goals and methods of effective and policyrelevant
science (Campbell, 1988, p 290).
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 9
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Social science and the Cochrane model: the beginnings
of the EPPI-Centre
By 2000, Donald Campbell’s name was attached, at Iain Chalmers’ suggestion
(Chalmers, 2003, p 34), to a sibling organisation to Cochrane, the Campbell
Collaboration, whose remit is preparing, maintaining and promoting access to
systematic reviews of studies on the effects of interventions in the social, behavioural
and educational arenas. The intervening years are a story of multiple, often poorly
linked, attempts to examine the ways in which other public policy areas might
benefit from what were widely conceived of as versions of ‘the Cochrane model’.
This is the context in which the EPPI-Centre’s own efforts were born.
In 1990, Ann Oakley and colleagues set up the Social Science Research Unit
(SSRU) at the Institute of Education with an explicit remit to undertake policy and
practice-relevant research across the domains of education and health, and a
commitment to the evaluation of interventions in these and related fields. Establishing
accessible and systematically collected and classified information about existing
relevant research was core to the vision of the unit from the very beginning. As part
of its new programme of work, a small (£25,000) grant proposal was submitted to
the Economic and Social Research Council (ESRC) in 1992 to establish what was
described as “a social science database of controlled interventions in education and
social welfare” which would “parallel similar databases established for medical
interventions” and would provide for the social science community a resource of
evidence on the effectiveness of interventions and future research needs. The aims of
the work were to carry out systematic searches of the social science research and
policy literatures from 1940 to date, entering data on well-designed evaluations
(prospective experimental studies with comparison groups) in the fields of education
and social welfare into an ongoing, updateable electronic database, to produce
summaries of the evidence for policy makers and researchers, and to combine
qualitative and quantitative data in evaluation studies. The ESRC at first rejected the
application for what became known as ‘the database project’, on the grounds that
there was no way of knowing whether there would be sufficient material to justify
the specialised computer database, and that the focus on prospective experimental
studies was inappropriate: “randomised control trials [sic] are not a sufficient condition
for sound conclusions”, since, despite their use in health care research, “with double
bind [sic] and other safeguards generally impossible in social science research … it is
not evident that randomised control trials are invariably preferable to, for example,
retrospective case [sic] control methods”. These comments drew energetic and
detailed responses from the applicant, after which the ESRC agreed to fund the
proposal.
The database project, undertaken in the first nine months of 1993, resulted in the
design of a specialist computer database and software for listing, classifying and
describing the scope and range of social science interventions; establishing effective
search strategies for locating over 1,000 reports of evaluations of education and
social welfare interventions; writing detailed methodological guidelines for reviewing
studies, including systematic keyword codes and standardised procedures for extracting
data from primary studies; and completing three systematic reviews. All this work
was infused with a pioneer mentality, and the very large workload was enthusiastically
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
10 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
borne by a small group of contract researchers labouring long hours in an airless
basement room. A second and larger application was made for funding later in 1993,
jointly with Geraldine Macdonald, an academic specialist in social welfare, then
Senior Lecturer in Social Studies at Royal Holloway, University of London, to build
on the first phase of the work. This proposed to develop the software, increasing its
accessibility; expand the literature searches to other fields; produce syntheses of
evidence for academics and policy makers; and develop collaborations with other
social scientists. This time the ESRC rejected the application outright, forwarding
copies of seven peer reviewers’ reports, most of which were positive, but three of
which returned to previous arguments about the inapplicability of ‘health care
evaluation methodology’ to the social policy field.
By this time, 1994, interest in a more systematic approach in the social sciences to
research evidence was growing in various academic and policy circles in the UK.
Within the Cochrane Collaboration, a critical mass of people had raised the need to
expand the collaboration’s remit by undertaking systematic reviews of non-clinical
issues, specifically of interventions aimed at behaviour change among professionals
and the public. A meeting to discuss how to take this forward was held at the King’s
Fund in March 1994; over 170 people from 14 countries expressed an interest. One
suggestion was a health promotion Review Group focussing on sexual health
interventions which “could be built on the data and experience gained at the Social
Science Research Unit” (SSRU) and would be closely linked to the Cochrane
Collaboration. It was recognised that, since much health promotion research came
from the social sciences, there was a need to develop the methodology of systematic
reviews to allow “inclusion of data from qualitative and process investigations”
(Reviews of Behavioural Research, 1994, p 6). The SSRU group offered itself at this
1994 meeting as a reference point for anyone interested in systematic reviews of
health promotion/behavioural change topics. Two subsequent international meetings
held in October 1994 at the Mario Negri Sud Centre in S. Mario Imbara, Italy put
firmly on the agenda the need for concerted review work in the area of the
behavioural prevention of sexually transmitted diseases, including and especially
AIDS, and the establishment of an infrastructure to support systematic reviews in
health promotion.
Meanwhile, the ‘database project’ was sustained and expanded with a series of
grants for systematic reviews of health promotion research (France-Dawson et al,
1994; Holland et al, 1994; Oakley and Fullerton, 1994, 1995; Oakley et al, 1998a,
1998b). Various unsuccessful applications during this period signalled the continuing
resistance of social science funding bodies to the whole enterprise of establishing a
resource of reliable policy-relevant review-based knowledge. Doubts were expressed
about whether it was sensible to commit resources to support systems for
accumulating and assessing knowledge on the ‘medical model’ rather than to carrying
out further primary research or quick, cheaper literature reviews.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 11
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Consolidation, growth and some uncomfortable revelations,
1995-2000
The two international meetings in Italy in October 1994 were instrumental in
putting the need for a resource to review the evidence from social interventions
high on the health research funding agenda. In their aftermath, the SSRU database
project successfully secured two three-year grants from the North Thames Regional
Health Authority to support initiatives in the move towards evidence-based sexual
health promotion, and from the Department of Health (DH) for the field coordination
of health promotion for the Cochrane Collaboration. These grants enabled the
project to expand its work, although each was relatively modest (one fulltime research
fellow, plus computing and administrative support). In 1995, the project was formally
renamed the EPI-Centre (Centre for the Evaluation of Health Promotion and Social
Interventions).
A major plank of the early EPI-Centre work concerned methodology and tools.
The database of research references was linked to the database of information from
systematic reviews, and the software that was used to review primary studies was
developed further. What began as a single-user system located on one computer
developed into a ‘networked’ database that could be used by several people at a time.
Sharing files on a network was relatively new in 1993 and the EPI-Centre invested
in its own server – a Pentium 75 with 32MB of memory – which was placed on the
floor underneath a desk. (As a measure of how much has changed, the SSRU now
has an entire air-conditioned room devoted to its servers.) The priority for database
development was the creation of a quick and user-friendly system that automated
some reviewing tasks, and which could generate structured summaries of studies
and graphs for use in writing up reviews.
Inquiries for help (finding research, designing evaluations) were a constant feature
of the EPI-Centre’s work during this period, running at about 35 per month. Other
significant developments in the period 1995-98 included workshops on evaluation
techniques for the National Audit Office and the UK Evaluation Society, and a
series of workshops on critical appraisal skills for 125 health promotion purchasers,
providers and researchers from both the statutory and voluntary sectors, which was
particularly instructive on the matter of policy makers’ impatience with the slow
pace and inaccessibility of much social science research evidence (Oliver et al, 1996).
More work was done on the guidelines for keywording and assessing studies, building
on the experience of early systematic searches of electronic databases, which revealed
the inconsistent and incomplete indexing of research (Harden et al, 1999). Some
5,000 references were systematically coded according to guidelines that were generic
across different topic areas and linked to the review software. The earlier reviews
were updated and new ones were undertaken. An important development was a
‘descriptive map’ commissioned by the DH of health promotion research relating to
young people. This identified and systematically keyworded 3,500 references related
to key areas listed in the ‘Health of the Nation’ government reports (Peersman,
1996). The map provided the DH with detailed information about which kinds of
studies had been done in the different topic areas; for example, of the 491 studies
found in the area of alcohol use/abuse, 87% were descriptive studies and 13% evaluated
some type of health promotion intervention. Intervention evaluation was generally
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
12 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
a small part of the research picture. During this period, the use of EPI-Centre
systems by external groups was piloted, and researchers at Southampton University
used them to review the evidence for strategies to prevent cervical cancer (Shepherd
et al, 2004).
One major point about systematic reviews is that they permit us to study our own
enterprise (Cooper and Hedges, 1994b, p 523). The early reviews produced a number
of clear lessons about social science research and systematic reviews. The first was
the ‘needle in the haystack’ syndrome; that is, large amounts of literature had to be
searched, often with great difficulty, in order to find small subsets of reliably designed
studies. For example, an analysis of EPI-Centre data in 1996 found that only 2.5%
(78) of the 3,180 references located for five reviews could be classified as
methodologically sound evaluations of interventions providing clear outcome data.
An even smaller proportion (1.2%) was ‘sound’ evaluations of effective interventions
(Oakley and Fullerton, 1986). Much of the literature located through electronic
searches is discarded, not because it is of poor quality, but because it is usually
impossible to design searches so that only reports of empirical research or reviews
relevant to the research question are found. Nonetheless, the effort required to find
relevant and reliable studies was a potentially distressing message for policy makers,
as well as for research funders and practitioners. These social science reviews also
confirmed those of health research in showing that better designed studies were
generally less likely to demonstrate effectiveness. In other words, poor study design
and over-optimistic results tended to be linked (Oakley and Fullerton, 1986; see
also Gough, 1993; MacLehose et al, 2000).
A second major finding echoed the conclusions of health care researchers: that
systematic searches usually uncovered many more studies in a particular area than
had previously been suspected. There was considerable duplication of effort, with
research funders themselves often ignorant of multiple efforts commissioned in the
same or adjacent areas. The EPI-Centre’s ‘review of reviews’, a survey completed in
1999 of all available (systematic and non-systematic) reviews of effectiveness in health
promotion, found 401 reviews (Peersman et al, 1999a). These included 156 covering
substance abuse, 72 in the area of sexual health, and 31 relating to accidents – many
more than practitioners, policy makers or researchers expected. Most (326) were
non-systematic reviews, although this did not prevent them making policy and
practice recommendations. The task of interpreting the evidence was complicated
by different reviews in the same topic area drawing on different literatures and
therefore often coming to different conclusions (Oliver et al, 1999; Peersman et al
1999b).
Third, it was clear that the potential usefulness of much research was reduced by
reporting deficiencies. Many studies might have been well conducted, but this could
not be surmised from the published papers. Applicability and generalisability were
limited by scant information about research samples, and there were ethical questions
about information and consent. Again, to take an example from the early EPICentre
reviews, of 215 sexual health and workplace intervention studies on the
database in 1997, 12% gave no information about the sex of research participants,
39% omitted details of age, 51% gave no information about social class, and ethnicity
was a missing variable in 56% of the studies; 41% gave no data on participation rates,
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 13
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
and 77% lacked information or were unclear about the informed consent of research
participants (Oakley et al, 1998).
The birth of evidence-informed education and other difficult
issues, 2000-05
In 1998, the ESRC commissioned a scoping exercise to examine the need for
investment in social science resources to support evidence-informed policy. Trevor
Sheldon of the York Institute for Research in the Social Sciences conducted this. It
concluded that the ESRC’s future strategy should include funding both a sustainable
infrastructure for the conduct and maintenance of systematic reviews in socially
useful areas, and work on a methodological framework for social science research
synthesis (Sheldon, 1998). By 2000, many other groups and individuals in the UK
and elsewhere had become involved in exploring issues of social science methodology
and evidence. The Campbell Collaboration, which was formally established at a
meeting in Pennsylvania in February 2000, had its origins in two meetings organised
by the School of Public Policy in London in 1999, attended by many of those active
in promoting systematic synthesis of social science research evidence (papers available
via www.campbellcollaboration.org). This took place against a background of growing
concern among policy makers about the usefulness and accessibility of social science
research (Blunkett, 2000).
In 2003, the Commission on the Social Sciences referred to the continuing poor
quality and inaccessibility of much social science research. The prevailing view in a
large part of central government, according to this report, was that “too much of the
research and analysis produced by academic social scientists is at best unexciting, at
worst simply not up to standard” (Commission on the Social Sciences, 2003, p 72).
A spate of reports on educational research was particularly damning (Hargreaves,
1996; Hillage et al, 1998; Tooley and Darby, 1998; McIntyre and McIntyre, 1999).
These criticisms led directly to the setting up in 1999 by the then Department for
Education and Employment of the National Educational Research Forum (NERF,
www.nerf-uk.org), a body charged with establishing a national framework and
strategy for high quality policy-relevant educational research in England. The forum
is chaired by Michael Peckham, who earlier headed the NHS research and
development programme which helped to fund the Cochrane Collaboration, and
who also set up the School of Public Policy in London, which hosted the start-up
meetings for the Campbell Collaboration. The forum’s Strategy Consultation Paper
(NERF, 2000), produced in 2000, called for a more coherent approach to educational
research and, like the ESRC scoping exercise, suggested a need for infrastructure
funding in this area.
Two major funding initiatives by the Department for Education and Skills (DfES)
and the ESRC resulted from these shifts in focus. The underlying rationale for both
was similar. As the initial tender document for the ESRC initiative put it, “The key
objective is that economic and social research … should provide an improved
knowledge base for users and academics alike that is relevant, accessible and qualityassured
(in terms of the design and methodologies of the research)”. The 61
submissions received by the ESRC in response to this tender were an indication of
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
14 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
the increased social science interest in this area. However, the DfES and ESRC
proceeded with different models. The ESRC decided to fund a national coordinating
centre (the ESRC UK Centre for Evidence-based Policy and Practice at Queen
Mary, University of London) and a number of associated, topic-specific nodes. The
DfES encouraged the different approach of a central training, methodological,
editorial and quality-assurance resource, with external groups of academics,
practitioners, policy makers and users reviewing the research evidence.
The EPI-Centre, which tendered for both initiatives, was awarded a five-year
grant by the DfES to establish a centre for evidence-informed education. The arrival
of ‘EPPI education’ enabled the EPI-Centre to achieve some of the spread envisaged
in the first small ESRC grant proposal in 1993. It was at this point that the second
‘P’ was inserted into the name (the Centre for the Evaluation of Health Promotion
and Social Interventions became the Evidence for Policy and Practice Information
and Coordinating Centre). The objective of the (continuing) DfES-funded initiative
is to provide an infrastructure for groups who want to undertake systematic reviews
of educational research. The EPPI-Centre provides training, support, methodological
tools and guidelines, quality assurance and a free web-based library – the Research
Evidence in Education Library (REEL) – of review protocols, finished reviews and
other databases (www.eppi.ioe.ac.uk). In 2004, the EPPI-Centre is working with
some 25 Review Groups, and some 70 completed systematic reviews are expected
by early 2005.
The development of the EP(P)I-Centre’s work into a broader supportive and
enabling role led to further changes in the infrastructure of the database. This was
moved onto the Internet, so that people can use it wherever they have web access.
The software (EPPI-Reviewer) was adapted to work with the different sets of
guidelines for describing studies and assessing study quality developed in different
topic domains (mainly education and health promotion). Classifications which were
common between guidelines (for example, ‘children’) needed to be identified as
such, so that searches carried out using these common terms would return all relevant
studies. Mirroring the Centre’s interest in developing methods for synthesis, the
software now also supports statistical meta-analysis and integrates with NVivo software
for more ‘qualitative’ analysis.
Challenges of social science research synthesis for informing
public policy
The two main streams of the EPPI-Centre’s work since 2000, in health promotion
and public health and in education, have proceeded on similar but also divergent
tracks. Each programme of work has generated particular challenges, and these in
turn illuminate the factors that promote or obstruct the development of academic
resources for supporting evidence-informed policy.
The main challenges are technical, methodological and epistemological (Oakley,
2003). They all relate to the application of the technology of systematic research
synthesis to the existing domain of social science research.
The primary technical challenge is the relatively undeveloped state of electronic
social science research databases. Many, for example ERIC and PsycLIT, lack
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 15
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
comprehensive keywording, thesauri of standardised search terms and sophisticated
search facilities. Many either classify study design inaccurately or not at all. This
makes literature searching more difficult and time consuming, and calls for the
invention of specialised strategies for each database. The deficiencies of the electronic
databases reflect the non-standardised way in which much social science research is
presented in journals; few social science/policy journals require structured abstracts
or systematic keywords that adequately describe the focus, design and content of
research. All of this hugely expands the ‘needle in a haystack’ syndrome, leading to
the contention that the cost of the effort to search systematically for social science
research literature may be quite disproportionate to the benefit (White, 2001).
A related issue is the way in which much social science research is described. The
reporting problem that was clear in the early health promotion reviews of intervention
studies extends to qualitative research and to education. Table 1 shows methodological
data for 48 ‘qualitative’ mainly school-based studies of children’s and young people’s
views relating to mental health, physical activity and healthy eating; only about half
described their samples or methods. Uncertainty about reliability is highlighted in
the low proportion (25%) of studies that make any explicit attempt to establish this.
Two particular problems concern study design and methodological soundness.
The question asked in a piece of research needs to be matched by a suitable research
design. For example, a question about whether a particular curriculum is effective
in increasing literacy is best answered with an experimental study that includes an
unbiased comparison group. A qualitative study of teachers’ or pupils’ views, while
yielding interesting information, will not answer an impact question, and therefore
cannot inform policy decisions about the general value of such a curriculum in
improving learning outcomes. This mismatch between research question and study
design, which is quite common in the educational research field, would, if applied
strictly, account for a high proportion of studies being assessed as ‘not sound’.
A considerable degree of conceptual confusion abounds in the educational research
community about how to classify different types of study design (this is witnessed in
the appendices to some of the reports on REEL). Despite the fact that most of the
review questions are about impact, evaluation studies have featured much less
prominently in the education than health promotion reviews. A major finding of
many reviews is thus that most studies addressed to answer impact questions are
Table 1: ‘Quality’ of qualitative research: studies of children and young
people’s views about mental health, physical activity and healthy eating
(n=48)
Aims clearly stated 42 (88%)
Context of study clearly described 39 (81%)
Sample clearly described 23 (48%)
Methods clearly described 25 (52%)
Attempts made to establish reliability and/or validity of data analysis 12 (25%)
Sources: Harden et al (2001); Rees et al (2001); Shepherd et al (2001); Brunton et al (2003);
Thomas et al (2003)
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
16 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
inappropriately designed. Early in the DfES-funded programme of education
systematic reviews, it became clear that many of the researchers involved also had
difficulty with the prominence given in existing EPPI-Centre review guidelines to
randomised controlled trials (RCTs) as a key tool of evidence-informed policy.
The many discussions that were had on this point led to an adaptation of the
terminology used in the guidelines for keywording educational study design. The
question about study design was put later in the education guidelines, in order to
reduce sensitivity to the issue, and a terminology of ‘naturally occurring’ versus
‘researcher-manipulated’ evaluation of interventions was adopted. Table 2 compares
the existing keywords for study design used in EPPI-Centre health promotion and
education reviews. The result of these changes in keywording strategy meant that it
became impossible to identify RCTs on the database of educational research
references, so an additional question had to be inserted at the end to remedy this.
Without this modification, it would be impossible to answer the important question
as to whether RCTs of policy interventions compared to other study designs lead
Table 2: EPPI-Centre guidelines for keywording health promotion and
education study design
Health promotion Education
Bibliography A. Description
Book review B. Exploration of relationships
Case control study C. Evaluation
Cohort study a. Naturally occurring
Commentary b. Researcher-manipulated
Economic evaluation D. Development of methodology
Glossary E. Review
Instrument design a. Systematic review
Intervention b. Other review
Meta analysis
Methodology
Need assessment
Outcome evaluation – RCT
– trial
– not stated
– other design
Process evaluation
Review
Secondary analysis
Secondary report – intervention
Secondary report – outcome evaluation
Secondary report – process evaluation
Survey
Systematic review
Sources: Peersman and Oliver (1997); EPPI-Centre (2002)
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 17
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
to different results. This is the focus of a project currently exploiting EPPI-Centre
data funded by the NHS methodology programme.
The aim of systematic reviews is to avoid drawing misleading conclusions either
from biases in primary research or from biases in the review process itself, so it
makes little sense to include all research; the point is to derive conclusions from
reliable and relevant research only. Some of the education Review Groups supported
by the EPPI-Centre have been unhappy with the general notion of methodological
soundness, and this has led to further adaptations of the guidelines for assessing
studies. The EPPI-Centre guidelines developed for assessing the methodological
soundness of health promotion effectiveness research, in common with many such
guidelines, contain questions designed to judge bias in the design or reporting of
evaluation studies, and are developing parallel criteria for qualitative studies. The
reluctance of some of the researchers involved in ‘EPPI-education’ to follow this
approach in part reflected the lessons of the early reviews, that much educational
research might turn out to be not sound if this approach were followed, but there is
also real difficulty in judging one’s own and colleagues’ work in what are often
small topic-focused fields. This discomfort led to the suggestion that studies be
assessed on soundness of method (that is, the extent to which a study has been carried
out according to accepted best practice within the terms of the method used); on
appropriateness of study type to answer the particular research question; and on relevance
of the topic focus of the study to the question being posed in the review. These
separate judgements can then be combined in an overall judgement of ‘weight of
evidence’ that determines which studies contribute most to the conclusions of the
review.
Interestingly, use of these more flexible criteria still revealed a considerable problem
with the quality of social science research evidence. Table 3 shows the conclusions
on methodological quality of the first 22 reviews on REEL using this approach. Of
the 291 studies reviewed in depth, a minority – 22% (63) – were assessed by review
groups as yielding ‘high quality’ overall weight of evidence. A further 48% (141)
were considered to have produced ‘medium’, and 30% (88) ‘low’ quality evidence.
The figures for soundness of method are slightly more discouraging: 25% (73) ‘high’,
49% (141) ‘medium’ and 26% (77) ‘low’ quality studies.
This is a considerable indictment of the state of social science research.
Table 3: Overall weight of evidence and soundness of study design:
studies included in the in-depth stages of 22 education reviews
Overall weight of evidence Soundness of study design
n % n %
High 63 22 73 25
Medium 140 48 141 49
Low 88 30 77 26
Total 291 100 291 100
Source: Research Evidence in Education Library, June 2004
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
18 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
An example of the way these issues concerning the search for sound literature, play
out within particular reviews is the review undertaken by Cordingley and colleagues
(2003) of the impact on teaching and learning of sustained, collaborative continuing
professional development (CPD) for teachers. The review, which was partly funded
by the National Union of Teachers, focused on teachers of the 5-16 age group and
on two aspects of the review question: whether CPD has an impact on teaching or
learning, and, if so, how this impact is realised. Searches of electronic databases and
handsearches identified 13,479 citations. Fifty-two studies – less than 1% of citations
– were relevant to the first stage of the review, the systematic map of research.
Seventeen of these 52 studies were included in the in-depth stage of the review.
Criteria for studies included in this stage were providing information about the
impact on teacher practice and student learning, and also meeting a set of ‘quality’
criteria for study design and reporting (clearly stated aims, clearly identified learning
objectives for teachers, clear description of context and methods, evidence of attempts
made to establish the reliability and validity of data analysis). The authors of each of
these studies reported at least some positive impact on teaching and/or learning
outcomes and/or processes. However, there were no studies which were considered
to contribute high quality evidence to the ‘whether’ question and only one was
judged to provide high quality evidence for the ‘how’ question. In view of this, the
evidence base for reaching reliable conclusions about the impact of CPD on teaching
and learning was considered to be thin.
Some solutions
These challenges of synthesising social science research have led over time to a
number of pragmatic adaptations in the technology of systematic reviews. Building
on the mapping report commissioned by the DH in 1996 (Peersman, 1996), EPPICentre
reviews increasingly use a two-stage model of systematic reviews. In stage
one, the relevant literature is located and described in order to provide a ‘map’ of
research activity in the area. ‘Mapping’ the literature is a useful product in itself, and
it also helps to counter the objection that too much literature is found and discarded.
It also helps researchers and policy makers to see what kinds of questions the research
can be used to answer. One implication of a two-stage model is that some reviews
may consist simply of a mapping stage; for example, a map of research on the effects
of travel on children as a scoping study for further research on children’s travel to
school (Gough et al, 2001). In the second stage of a review, a smaller subset of
studies is used to answer a more focused question. Criteria used to select the smaller
group of studies are commonly based on study design and topic as well as specifying
countries and time periods. For example, a systematic map of research on personal
development planning as a strategy for improving student outcomes showed that
most UK research in this field had been descriptive, with most experimental studies
of impact being conducted in the US (Gough et al, 2003). With a two-stage review
process, groups of users also have the opportunity to help define in-depth review
questions (Oliver, 2001). Where research is commissioned directly by policy makers,
this user consultation process can help to make systematic reviews more policyrelevant,
since policy makers are involved in a process of joint ownership of the
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 19
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
knowledge product. This is one of the many ways in which the methods used to
conduct research synthesis can directly influence implementation, since with a process
of joint ownership policy makers are more likely to use the results.
The scant evidence available to inform the implementation of effective
interventions invites two responses: improve the standards of effectiveness research,
or synthesise the evidence for ‘promising’ interventions that may ultimately be shown
to be effective. This second option was the prompt for methodological developments
to establish ways of including a wider range of research designs in systematic reviews,
and for making more use of research on the processes involved in intervention
development and impact. These moves have recently been developed considerably
in a series of five reviews carried out by the EPPI-Centre’s health promotion team
(Harden et al, 2001; Rees et al, 2001; Shepherd et al, 2001; Brunton et al, 2003;
Thomas et al, 2003). These reviews are concerned with barriers and facilitators
research in the area of children and young people’s healthy eating and physical
activity and young people’s mental health. What has emerged is an innovative model
in which the findings of well-designed intervention research are set alongside data
from ‘qualitative’ studies that are more likely to represent the views of target
populations. The resultant synthesis provides statistical information about effect sizes
and discriminates between those interventions that appear to have roots in people’s
expressed needs and experiences and those that lack this feature. Policy makers’
‘what works’ questions are thus set in the much broader social context of public
expectations and experiences. From another direction, the ‘weight of evidence’
approach used in many of the EPPI-Centre-supported education reviews also
recognises the relevance of different types of study design.
Lessons about barriers and facilitators
The work of the EPPI-Centre has not been conducted in isolation from the work
of many other individuals and groups, all of which, particularly in the period since
the late 1990s, have made enormous strides in addressing the usefulness of social
science research to public policy. The story told in this paper is only part of a much
bigger picture, but certain lessons can be drawn from it.
Some key facilitators
Earlier eras during which evidence-informed policy has been on the agenda have
been marked by some explicit articulation of the need for evidence on the part of
policy makers. This was the case in the US, where policy-relevant social science was
prominent in the period from the 1940s to the 1970s, and included the evaluation
of many large-scale social experiments (Haveman, 1987; Oakley, 1998a). The general
principle was that of ‘systematic thinking for social action’ (Rivlin, 1971). Roosevelt’s
New Deal gave a formal place to social scientists in federal government, although
the term ‘evidence-informed (or evidence-based) policy’ was not explicitly used.
Accounts of barriers to effective utilisation written at the time included the inwardlooking
focus of many social scientists, with research literature tending to be written
from a narrow disciplinary perspective and using much technical language, and
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
20 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
policy processes being “treated almost as incidental aspects of the activity” (Scott
and Shore, 1979, p x).
The same driver of government interest is clear in the more recent history of
evidence-informed policy in the UK described in this article. A policy demand for
reliable knowledge about effectiveness in a context of scarce resources was one
factor stimulating the early systematic reviews of health care. Others included public
concern about preventable death and disease, the fallibility of experts, and the chaotic
and possibly damaging use of health care resources in the specific fields of cancer,
heart disease and stroke, and maternity care. A similar concern about the implication
of scarce resources in the social policy field and the importance of problem-prevention
has been instrumental in encouraging social science resources to support evidence.
However, the move towards the systematic review and evaluation of social
interventions appears to have been made more by academic researchers, through a
network in which ideas about policy-relevant research synthesis and rigorous
evaluation have crossed over from one policy domain to others. This is partly a
process of reinventing history, as social scientists have unearthed their own forgotten
disciplinary history of research synthesis and social experimentation, but it is also a
matter of abandoning territoriality in a joint commitment to the goals of an evidenceinformed
society, where evidence is democratically produced and owned.
Despite the initial all-embracing vision of the database project, the main funders
of the EPPI-Centre’s activities in social science research synthesis before 2000 were
in the health promotion arena, and their specific interest was in answering questions
about effectiveness. This reflected the growing demand for reviews of non-clinical
research relevant to health, especially as the focus of government promulgations
during the 1990s was increasingly on prevention as the key to better public health
(DH, 1998, 1999). Especially critical to the EPPI-Centre’s early funding was the
HIV/AIDS ‘epidemic’, which signalled the need for serious work evaluating ‘social
approaches’ to HIV prevention, since there were no known effective clinical
treatments at that time (Oakley and Darrow, 1998). However, the underlying
methodological point in the focus on health promotion is that health promotion is
an example of a social intervention, so the work involved in developing a health
promotion evidence-base also serves other areas of social intervention and policy,
including social welfare, crime and justice, education and transport.
One conclusion to be drawn from the story in this paper supports other case
studies in the sociology of science and institutional development in highlighting the
importance of interactional networks and ‘opinion leaders’ in generating and
sustaining the move towards academic resources for evidence-based policy. It takes
people (academics, practitioners, policy makers, service users, research funders) with
unswerving commitment and energy to promulgate ‘new’ ideas and to champion
these in the face of considerable resistance, including the resistance embedded in the
shared consensus of ‘experts’ or ‘tacit knowledge’ (Nutley et al, 2003). People learn
most from their friends and workmates; these are the interactional settings that
foster the lateral transfer of ideas. The early involvement of social scientists in evidencebased
health care was crucial in reminding social scientists that the field of social
policy is just as much in need of attempts to build on the cumulative lessons of wellconducted
research, and the public has the same need to be protected from the
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 21
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
adverse consequences of the well-intentioned but ill-founded efforts of practitioners
and policy makers to improve national well-being. People may rarely die directly as
the result of social interventions, but there is plenty of evidence of unanticipated
harm (see, for example, Christopher and Roosa, 1990; Petrosino et al, 2000; Achara
et al, 2001).
... and some barriers
A major barrier to the production of systematic evidence for policy lies in the
culture of academia and the funding practices of commissioners of research. Academic
culture supports the proliferation of small-scale research projects that are often
strikingly parochial and lack international vision and any sense of connectedness
with one another. This pattern is in many ways supported and enhanced by the
thinking behind the system current in the UK of funding universities through the
cycle of Research Assessment Exercises. An ethos of competition rather than
collaboration impedes the development of a cumulative knowledge base (DfES,
2002). A recent example in the education field is two reviews of the evidence
relating to the impact of summative assessment (formal testing leading to final grading
decisions) on students’ motivation for learning. Two reviews, published within two
years of one another (Harlen and Deaken Crick, 2002; Torrance and Coultas, 2004),
address essentially the same question using different methods and drawing on different
literatures. A systematic review (Harlen and Deaken Crick, 2002), focusing on students
aged 4-19, found eight previous reviews (one of these systematic) and included 19
primary studies, 12 of which were assessed as high quality. These suggested that
summative assessment impacts negatively on the self-esteem of low-achieving
students. A non-systematic review published two years later (Torrance and Coultas,
2004), focusing on post-16 learners, concluded that there was no material of direct
relevance to the review question, but that evidence suggests that learners across all
sectors prefer coursework and practical competence-oriented assessment to
summative models. Only one of the eight previous reviews and one of the 19 studies
included in the systematic review was referenced in the non-systematic review. The
difference in material included is partly explained by the different age focus. However,
Harlen and Deaken Crick (2002, pp 50-1) include an analysis of data by age in their
review, identifying evidence that the negative impact of summative assessment may
be greater in older low-achieving students.
Methodological ‘paradigm’ warfare and confused ideas about systematic reviews
inhibit the development of an accessible, systematic and cumulative evidence base
that can serve the interests of different research user groups. The major theme in
debates about social research and evidence-informed policy today concerns what is
seen as the unwarranted transfer from the medical to the public policy domain of a
‘positivistic’ model of knowledge dominated by a ‘hierarchy of evidence’, limited
questions about ‘what works’, outdated notions about the role of ‘procedural
objectivity’, and an unhelpfully dismissive view of practitioners’ ‘craft’ knowledge
(see, for example, Hammersley, 1997, 2001; Hargreaves, 1997; Atkinson, 2000; Ball,
2001; Elliott, 2001; Hulme, 2002). Like RCTs, with which they are often semantically
confused, systematic reviews may be seen as part of a trend towards ‘instrumentalism’
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
22 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
which puts “nails in the coffin of academic freedom” (Atkinson, 2000, p 318), and
the attempt to devise quality criteria, particularly for ‘qualitative’ studies, is seen as a
regulative mechanism endorsed by the state rather than an effort by the academic
community to put its methodological house in order (Torrance, 2004).
Although the methodological crossover from health care has been a factor aiding
the development of social science resources for evidence-informed policy, it has also
acted as an impediment. Most of the methodological tools which are critical to
today’s evidence movement, including systematic reviews, quality markers and robustly
designed experimental evaluations, are associated with the ‘quantitative’ paradigm of
medicine and not with ‘qualitative’ approaches to knowledge which are more
common in education or other areas of public policy. Educational researchers are
broadly more committed to resisting than defending ‘the Cochrane model’; they
draw more centrally on social science theories of postmodernism to support this
position (Oakley, 2001). Echoes of this in health promotion reflect the marginality
of the discipline in relation to health care (Oakley, 1998b), but have lessened over
time, as a consensus has built up that systematic reviews, suitably adapted for a social
science climate, can be a useful adjunct to primary studies.
Fuelling the rejection by some academics of research synthesis to support evidenceinformed
policy is the emergence within social science of poststructuralist and
postmodernist positions, according to which social reality is permanently under
construction and therefore an irremediably fugitive subject for policy makers; science
itself is part of the problem (Beck, 1992). Bailey (2001) has termed this ‘veriphobia’
(the fear of truth), in the light of which all appeals to the need for objectivity and
the pursuit of truth are viewed as either naive or sinister, and the truth-seeking
character of inquiry is seen as the expression of a misguided approach inevitably
destined for failure. As Bailey points out, there is something ‘inescapably selfcontradictory’
about veriphobia, since statements expressing contingent points of
view are inherently objective in intent, and it is impossible to make one’s way
effectively round the world while clinging to the notion that truth is simply the
consensus of particular social groups.
Research synthesis, as the main tool of evidence-informed policy, is a serious and
time-consuming activity, equivalent to primary research, but it is often seen as “secondclass,
scientifically derivative work, unworthy of mention in reports and documents
intended to confirm the scientific credentials of individuals and institutions”
(Chalmers et al, 2002, p 21). Inadequacies in research training, particularly with
respect to quantitative and experimental methods among social scientists, support a
lack of capacity for carrying out systematic reviews. Systematic literature searches
and classification, keywording and data extraction for a review, call for a high level
of numeracy and attention to detail as well as general analytic and interpretative
skills. Since they may involve looking at all kinds of studies, systematic reviews
require expertise in research methods as diverse as in-depth qualitative interviews,
and the statistical meta-analysis of data from large-scale RCTs. However, in the UK
there are few, if any, opportunities in social science for a generic research training
that equips researchers with a broad set of methodological skills. Systematic research
synthesis, despite its long tradition in the social sciences, is poorly covered in most
research methods courses. These deficiencies may help to explain British social
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 23
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
scientists’ generally lower commitment to the tools of the evidence movement
compared with their colleagues in the US (Jowell, 2003). A feature of the earlier
period in the US of ‘systematic thinking for social action’ was the establishment in
universities of training programmes for evaluation and policy studies (Haveman,
1987).
Conclusion
The story told in this paper is one about the history and sociology of science and
methodology, but it is also part of what Solesbury (2002, p 91) has termed “longstanding
debates about knowledge and power, rationality and politics, democracy
and technocracy”. These debates take different forms in different disciplinary and
professional circles, but a central issue concerns the ownership of, and access to,
knowledge. Collating the results of research in free electronically held libraries of
systematic reviews is a significant move in the direction of a democratic knowledge
economy. The resistances to this move therefore have much to say about the role of
experts in the postmodern world, and about the ‘mystique of judgement’ that has
ruled much social science, and has been a major factor rendering it opaque as a tool
for policy makers. Authors of systematic reviews are called on to make their
judgements explicit: in the choice of question, the selection and appraisal of literature,
and the interpretation of the findings, and their efforts to involve others in making
these judgements. Not coincidentally, the work described in this paper has crucially
depended on the ability to cross (or at least ignore) the traditional boundaries between
academic disciplines, and to view the production of policy-useful evidence as a
constantly evolving, rather than fixed, process. It is not only the culture of academia
that poses a barrier to the production of systematic research syntheses. Where policy
makers and practitioners often rely on the opinions of a small circle of professionals
rather than on published information about effectiveness, either because of their
attitudes to this type of knowledge, or lack of easy access, there is little demand for
systematic research syntheses (Shadish and Epstein, 1987; Bonell, 1996; Oliver et al,
2001).
In her autobiography, In a world I never made, published in 1967, the distinguished
British social scientist Barbara Wootton discussed contemporary failures in the ‘science’
component of social science. She argued (1967, pp 216-17) that:
The potential significance of the ... adoption of a scientific attitude to social questions
is ... incalculable. For the developments that we are now witnessing may well prove to
be the harbingers of a massive invasion of science into areas previously ruled by hunch
and guess-work. Inevitably these invaders, with their computers and their calculations,
threaten our sentimental attachment to the conventional wisdom and to the mystique
of judgment, no less than the new factories of an earlier century threatened the
handloom weavers.
Like the machinery that upset the handloom weavers, the new technologies of
social science evidence will probably be with us for some time. Their fate is uncertain.
Important questions concern the willingness of social science funders to invest in
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
24 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
what continues to be viewed as a risky enterprise, the readiness of social scientists to
confront deficiencies in the conduct, reporting and classification of social science
research, and the commitment of policy makers to pursue meaningful notions of
evidence-informed policy in the face of apparently viable and easier alternatives.
Acknowledgement
Thanks to Iain Chalmers and to the journal’s anonymous referees for their helpful
comments and suggestions on the draft of this article.
References
Abrams, P. (1968) The origins of British sociology 1834-1914: An essay, Chicago, IL: Chicago
University Press.
Achara, S., Adeyemi, B., Dosekun, E., Kelleher, S., Lansley, M., Male, I., Muhialdin, N.,
Reynolds, L., Roberts, I., Smailbegovic, M. and van der Spek, N. (2001) ‘Evidencebased
road safety: the Driving Standards Agency’s schools programme’, Lancet, vol 358,
no 9277, pp 230-2.
Antman, E., Lau, J., Kupelnick, B., Mosteller, F. and Chalmers, T. (1992) ‘A comparison of
results of meta-analyses of randomised controlled trials and recommendations of
clinical experts: treatments for myocardial infarction’, Journal of the American Medical
Association, vol 268, no 2, pp 240-8.
Atkinson, E. (2000) ‘In defence of ideas, or why “what works” is not enough’, British
Journal of Sociology of Education, vol 21, no 3, pp 317-30.
Badger, D., Nursten, J., Williams, P. and Woodward, M. (2000) ‘Should all literature
reviews be systematic?’, Evaluation and Research in Education, vol 14, nos 3-4, pp 220-
30.
Bailey, R. (2001) ‘Overcoming veriphobia – learning to love truth again’, British Journal
of Educational Studies, vol 49, no 2, pp 159-72.
Ball, S.J. (2001) ‘“You’ve been NERFed!” Dumbing down the academy. National
Educational Research Forum: “A national strategy – consultation paper”: a brief and
bilious response’, Journal of Education Policy, vol 16, no 3, pp 265-8.
Beck, U. (1992) Risk society: Towards a new modernity, London: Sage Publications.
Blume, S. (1979) ‘Policy studies and social policy in Britain’, Journal of Social Policy, vol 8,
no 3, pp 311-34.
Blunkett, D. (2000) ‘Influence or irrelevance: can social science improve government?’,
Research Intelligence, no 71, pp 12-21 (Speech given at ESRC, Swindon, 25 February,
www.bera.ac.uk/ri/no71/index.html).
Bonell, C. (1996) Outcomes in HIV prevention: Report of a research project, London: HIV
Project.
Brunton, G., Harden, A., Rees, R., Kavanagh, J., Oliver, S. and Oakley, A. (2003) Children
and physical activity: A systematic review of barriers and facilitators, University of London:
EPPI-Centre, Social Science Research Unit, Institute of Education
(www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?&page=/hp/reviews.htm).
Bulmer, M. (1986) Social science and social policy, London: Allen and Unwin.
Campbell, D.T. (1988) ‘The experimenting society’, in D.T Campbell and E.S. Overman
(eds) Methodology and epistemology for social science: Selected papers (1st edn 1971),
Chicago, IL: Chicago University Press, pp 290-314.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 25
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
CERI (Centre for Educational Research and Innovation) (2000) Knowledge management
in the learning society, Paris: OECD.
Chalmers, I. (1991a) ‘The work of the National Perinatal Epidemiology Unit: one
example of technology assessment in perinatal care’, International Journal of Technology
Assessment in Health Care, vol 7, no 4, pp 430-59.
Chalmers, I. (ed) (1991b) Oxford database of perinatal trials, version 1.2 (Disk issue 6),
Oxford: Oxford University Press.
Chalmers, I. (2003) ‘Trying to do more good than harm in policy and practice: the role
of rigorous, transparent, up-to-date evaluations’, Annals of the American Academy of
Political and Social Science, vol 589, pp 22-40.
Chalmers, I. and Altman, D. (eds) (1995) Systematic reviews, London: BMJ Books.
Chalmers, I., Enkin, M. and Keirse, M.J.N.C. (eds) (1989) Effective care in pregnancy and
childbirth, Oxford: Oxford University Press.
Chalmers, I., Hedges, L. and Cooper, H. (2002) ‘A brief history of research synthesis’,
Evaluation and the Health Professions, vol 25, no 9, pp 12-37.
Christopher, F.S. and Roosa, M.W. (1990) ‘An evaluation of an adolescent pregnancy
prevention program: is “Just say no” enough?’, Family Relations, vol 39, no 1, pp 68-72.
Clarke, M. and Langhorne, P. (2001) ‘Editorial: revisiting the Cochrane Collaboration:
meeting the challenge of Archie Cochrane – and facing up to some new ones’, British
Medical Journal, vol 323, no 7317, p 821.
Cochrane, A.L. (1972) Effectiveness and efficiency. Random reflections on health services,
London: Nuffield Provincial Hospitals Trust.
Commission on the Social Sciences (2003) Great expectations: The social sciences in Britain,
London: Commission on the Social Sciences (www.the-academy.org.uk).
Cooper, H. and Hedges, L. (eds) (1994a) The handbook of research synthesis, New York, NY:
Russell Sage Foundation.
Cooper, H. and Hedges, L. (1994b) ‘Potentials and limitations of research synthesis’, in H.
Cooper and L. Hedges (eds) The handbook of research synthesis, New York, NY: Russell
Sage Foundation, pp 521-9.
Cordingley, P., Bell, M., Rundell, B. and Evans, D. (2003) The impact of collaborative CPD
on classroom teaching and learning (EPPI-Centre Review), London: EPPI-Centre, Social
Science Research Unit, Institute of Education, University of London
(www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/reel/reviews.htm).
Davies, H.T.O. and Nutley, S. (2002) Evidence-based policy and practice: Moving from rhetoric
to reality (Discussion Paper 2), St Andrews: Research Unit for Research Utilisation,
Department of Management, University of St Andrews (www.st-andrews.ac.uk/
~ruru/publications.htm).
Davies, H.T.O., Nutley, S. and Smith, P. (2000a) ‘Introducing evidence-based policy and
practice in public services’, in H.T.O. Davies, S.M. Nutley and P.C. Smith (eds) What
works? Evidence-based policy and practice in public services, Bristol: The Policy Press, pp 1-
11.
Davies, H.T.O., Nutley, S.M. and Smith, P.C. (eds) (2000b) What works? Evidence-based
policy and practice in public services, Bristol: The Policy Press.
Davies, P. (2000) ‘The relevance of systematic reviews to educational policy and practice’,
Oxford Review of Education, vol 26, nos 3-4, pp 365-78.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
26 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
DfES (Department for Education and Skills) (2002) Educational research and development in
England: Background report (CERI/CD(2002)11), Paris: OECD (www.oecd.org/
dataoecd/17/58/1837582.pdf).
DH (Department of Health) (1998) Our healthier nation. A contract for health: A consultation
paper (Cm 3852), London: The Stationery Office.
DH (1999) Saving lives: Our healthier nation (Cm 4386), London: The Stationery Office.
Elliott, J. (2001) ‘Making evidence-based practice educational’, British Educational Research
Journal, vol 27, no 5, pp 555-74.
Enkin, M., Keirse, M.J.N.C. and Chalmers, I. (1989) A guide to effective care in pregnancy
and childbirth, Oxford: Oxford University Press.
EPPI-Centre (2002) Core keywording strategy for classifying educational research (Version
0.9.7), London: EPPI-Centre, Social Science Research Unit, Institute of Education,
University of London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?&page=/reel/
tools.htm).
Finch, J. (1986) Research and policy: The use of qualitative methods in social and educational
research, London: Falmer Press.
Fletcher, M. and Lockhart, I. (2003) The impact of financial circumstances on engagement with
post-16 learning: A systematic map of research (EPPI-Centre Review), London: EPPICentre,
Social Science Research Unit, Institute of Education, University of London
(www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?&page=/reel/reviews.htm).
France-Dawson, M., Holland, J., Fullerton, D., Kelley, P., Arnold, S. and Oakley, A. (1994)
Review of effectiveness of workplace health promotion interventions (Final report for the
Health Education Authority), London: Social Science Research Unit, Institute of
Education, University of London.
Glass, G.V., McGaw, B. and Smith, M.L. (1981) Meta-analysis in social research, Beverly
Hills, CA: Sage Publications.
Gough, D. (1993) Child abuse interventions: A review of the research literature, London:
HMSO.
Gough, D. (2004) ‘Systematic research synthesis’, in G. Thomas and R. Pring (eds)
Evidence-based practice in education, Buckingham: Open University Press, pp 44-64.
Gough, D.A., Kiwan, D., Sutcliffe, K., Simpson, D. and Houghton, N. (2003) A systematic
map and synthesis review of the effectiveness of personal development planning for improving
student learning, London: EPPI-Centre, Social Science Research Unit, Institute of
Education, University of London (www.eppi.ioe.ac.uk/EPPIWeb/
home.aspx?&page=/reel/reviews.htm).
Gough, D. A., Oliver, S., Brunton, G., Selai, C. and Schaumberg, H. (2001) The effect of
travel modes on children’s mental health, cognitive and social development: A systematic review
(Final report for Department of the Environment, Transport and the Regions),
London: EPPI-Centre, Social Science Research Unit, Institute of Education,
University of London.
Hammersley, M. (1995) The politics of social research, London: Sage Publications.
Hammersley, M. (1997) ‘Educational research and teaching: a response to David
Hargreaves’ TTA lecture’, British Educational Research Journal, vol 23, no 2, pp 141-61.
Hammersley, M. (2001) ‘On systematic reviews of research literatures: a “narrative”
response to Evans and Benefield’, British Educational Research Journal, vol 27, no 5, pp
543-54.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 27
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Harden, A., Peersman, G., Oliver, S. and Oakley, A. (1999) ‘Identifying primary research
on electronic databases to inform decision-making in health promotion: the case of
sexual health promotion’, Health Education Journal, vol 58, no 3, pp 290-301.
Harden, A., Rees, R., Shepherd, J., Brunton, G., Oliver, S. and Oakley, A. (2001) Young
people and mental health: A systematic review of research on barriers and facilitators, London:
EPPI-Centre, Social Science Research Unit, Institute of Education, University of
London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/hp/reviews.htm).
Hargreaves, D. (1996) Teaching as a research-based profession: Possibilities and prospects (TTA
Annual Lecture), London: Teacher Training Agency.
Hargreaves, D. (1997) ‘In defence of research for evidence-based teaching: a rejoinder to
Martyn Hammersley’, British Educational Research Journal, vol 23, no 4, pp 405-19.
Harlen, W. and Deaken Crick, R. (2002) A systematic review of the impact of summative
assessment and tests on students’ motivation for learning (EPPI-Centre Review, Version
1.1), London: EPPI-Centre, Social Science Research Unit, Institute of Education,
University of London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?&page=/reel/
reviews.htm).
Haveman, R.H. (1987) ‘Policy analysis and evaluation research after twenty years’, Policy
Studies Journal, vol 16, no 2, pp 191-218.
Health Committee (1992) Maternity services. Volume 1: Report – Second report, session 1991-
92 (House of Commons Paper 1991-92 29-I), London: HMSO.
Heinemann, R.A., Bluhm, W.T., Peterson, S.A. and Kearney, E.D. (1990) The world of the
policy analyst: Rationality, values and politics, Chatham, NJ: Chatham House.
Hillage, J., Pearson, R., Anderson, A. and Tamkin, P. (1998) Excellence in research on schools
(Research Report 74), London: Department for Education and Employment.
Holland, J., Arnold, S., Fullerton, D. and Oakley, A. with Hart, G. (1994) Review of
effectiveness of health promotion interventions for men who have sex with men (Report for
the Health Education Authority), London: Social Science Research Unit, University
of London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/hp/reviews.htm).
Hulme, B. (2002) ‘Policy transfer in evidence-based practice in education’, British
Educational Research Association Annual Conference (Abstract), University of Exeter,
12-14 September.
Hunt, M. (1997) How science takes stock: The story of meta-analysis, New York, NY: Russell
Sage Foundation.
Jackson, G.B. (1980) ‘Methods for integrative reviews’, Review of Educational Research, vol
50, no 3, pp 438-60.
Janowitz, M. (1972) Sociological models and social policy, Morristown, NJ: General Learning
Systems.
Jowell, R. (2003) Trying it out. The role of ‘pilots’ in policy-making: Report of a review of
government pilots, London: Cabinet Office Strategy Unit (www.policyhub.gov.uk/
evalpolicy/gcsro_papers.asp).
Light, R.J. and Pillemer, D. (1984) Summing up: The science of reviewing research, Cambridge,
MA: Harvard University Press.
McCall, W.A. (1923) How to experiment in education, New York, NY: Macmillan.
McIntyre, D. and McIntyre, A. (1999) Capacity for research into teaching and learning: Final
report (Final Report to ESRC Teaching and Learning Programme), Cambridge:
School of Education, University of Cambridge (www.tlrp.org/pub/acadpub.html).
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
28 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
MacLehose, R.R., Reeves, B.C., Harvey, I.M., Sheldon, T.A., Russell, I.T.Y. and Black,
A.M.S. (2000) ‘A systematic review of comparisons of effect sizes derived from
randomised and non-randomised studies’, Health Technology Assessment, vol 4, no 34, pp
1-154.
Moynihan, R. (2004) Evaluating health services: A reporter covers the science of research
synthesis, New York, NY: Milbank Memorial Fund (www.milbank.org/reports/
reportstest.html).
NERF (National Education Research Forum) (2000) Research and development for
education: A national strategy consultation paper, Nottingham: NERF Publications.
Nutley, S., Walter, I. and Davies, H.T.O. (2003) ‘From knowing to doing: a framework for
understanding the evidence-into-practice agenda’, Evaluation, vol 9, no 2, pp 125-48.
Nutley, S. and Webb, J. (2000) ‘Evidence and the policy process’, in H.T.O. Davies, S.M.
Nutley and P.C. Smith (eds) What works? Evidence-based policy and practice in public
services, Bristol: The Policy Press, pp 13-41.
Oakley, A. (1998a) ‘Public policy experimentation: lessons from America’, Policy Studies,
vol 19, no 2, pp 93-114.
Oakley, A. (1998b) ‘Experimentation in social science: the case of health promotion’,
Social Sciences in Health, vol 4, no 2, pp 73-89.
Oakley, A. (2000) Experiments in knowing: Gender and method in the social sciences,
Cambridge: Polity Press.
Oakley, A. (2001) ‘Making evidence-based practice educational: a rejoiner to John
Elliott’, British Educational Journal, vol 27, no 5, pp 575-6.
Oakley, A. (2003) ‘Research evidence, knowledge management and educational practice:
early lessons from a systematic approach’, London Review of Education, vol 1, no 1, pp
21-33.
Oakley, A. and Darrow, W.W. (1998) ‘AIDS 1998: social, cultural and political aspects:
overview’, AIDS, vol 12, Supplement A, pp S189-90.
Oakley, A. and Fullerton, D. (1986) ‘The lamp-post of research: support or illumination?’,
in A. Oakley and H. Roberts (eds) (1996) Evaluating social interventions: A report of two
workshops funded by the Economic and Social Research Council, Ilford: Barnardos, pp 4-38.
Oakley, A. and Fullerton, D. (1994) Risk, knowledge and behaviour: HIV/AIDS education
programmes and young people (Report for North Thames Regional Health Authority),
London: Social Science Research Unit, Institute of Education, University of London.
Oakley, A. and Fullerton, D. (1995) Young people and smoking (Report for North Thames
Regional Health Authority), London: EPI-Centre, Social Science Research Unit,
Institute of Education, University of London.
Oakley, A., Fullerton, D. and Holland, J. (1995a) ‘Behavioural interventions for HIV/
AIDS prevention’, AIDS, vol 9, no 5, pp 479-86.
Oakley, A., Fullerton, D., Holland, J., Arnold, S., France-Dawson, M., Kelley, P. and
McGrellis, S. (1995b) ‘Sexual health interventions for young people: a methodological
review’, British Medical Journal, vol 310, no 6973, pp 158-62.
Oakley, A., Peersman, G. and Oliver, S. (1998) ‘Social characteristics of participants in
health promotion effectiveness research: trial and error?’, Education for Health, vol 11,
no 3, pp 305-17.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 29
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Oliver, S. (2001) ‘Making research more useful: integrating different perspectives and
different methods’, in S. Oliver and G. Peersman (eds) Using research for effective health
promotion, Buckingham: Open University Press, pp 167-79.
Oliver, S., Nicholas, A. and Oakley, A. (1996) Promoting health after sifting the evidence
(PHASE) – Workshop report, London: EPPI-Centre, Social Science Research Unit,
Institute of Education, University of London (www.eppi.ioe.ac.uk/EPPIWeb/
home.aspx?&page=/hp/reviews.htm).
Oliver, S. and Peersman, G. (eds) (2001) Using research for effective health promotion,
Buckingham: Open University Press.
Oliver, S., Peersman, G., Harden, A. and Oakley, A. (1999) ‘Discrepancies in findings from
effectiveness reviews: the case of health promotion for older people in accident and
injury prevention’, Health Education Journal, vol 58, no 1, pp 66-77.
Oliver, S., Peersman, G., Oakley, A. and Nicholas, A. (2001) ‘Using research: challenges in
evidence-informed service planning’, in S. Oliver and G. Peersman (eds) Using research
for effective health promotion, Buckingham: Open University Press, pp 157-66.
Oughton, J. (2002) ‘Evidence-based defence’, Public Management Policy Association Review,
vol 9, pp 12-13.
Peersman, G. (1996) A descriptive mapping of health promotion studies in young people,
London: EPI-Centre, Social Science Research Unit, Institute of Education, University
of London.
Peersman, G., Harden, A., Oliver, S. and Oakley, A. (1999a) Effectiveness reviews in health
promotion, London: EPI-Centre, Social Science Research Unit, Institute of Education,
University of London.
Peersman, G., Harden, A., Oliver A., and Oakley, A. (1999b) ‘Discrepancies in findings
from effectiveness reviews: the case of health promotion interventions to change
cholesterol levels’, Health Education Journal, vol 58, no 2, pp 192-202.
Peersman, G. and Oliver, S. (1997) EPPI-Centre keywording strategy, London: EPPI-Centre,
Institute of Education, University of London.
Petrosino, A., Turpin-Petrosino, C. and Finckenauer, J.O. (2000) ‘Well-meaning programs
can have harmful effects! Lessons from experiments such as Scared Straight’, Crime
and Delinquency, vol 46, no 3, pp 354-79.
Rees, R., Harden, A., Shepherd, J., Brunton, G., Oliver, S. and Oakley, A. (2001) Young
people and physical activity: A systematic review of research on barriers and facilitators,
London: EPPI-Centre, Social Science Research Unit, Institute of Education,
University of London (eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/hp/reviews.htm).
Reviews of Behavioural Research within the Cochrane Collaboration (1994) Newsletter,
August.
Rivlin, A.M. (1971) Systematic thinking for social action, Washington, DC: The Brookings
Institution.
Rosenberg, W. and Donald, A.(1995) ‘Evidence-based medicine: an approach to clinical
problem solving’, British Medical Journal, vol 310, no 6987, pp 1122-6.
Rossi, P.H. and Williams, W. (1972) Evaluating social programs: Theory, practice and politics,
New York, NY: Seminar Press.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
30 The politics of evidence and methodology
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Schulz, K. F., Chalmers, I., Hayes, R. J. and Altman, D. (1995) ‘Empirical evidence of bias:
dimensions of methodological quality associated with estimates of treatment effects
in controlled trials,’ Journal of the American Medical Association, vol 273, no 5, pp 408-
12.
Scott, R.A. and Shore, A.R. (1979) Why sociology does not apply: A study of the use of
sociology in public policy, New York, NY: Elsevier.
Shadish, W.R. and Epstein, R. (1987) ‘Patterns of program evaluation practice among
members of the Evaluation Research Society and Evaluation Network’, Evaluation
Review, vol 11, pp 555-90.
Sheldon, T. (1998) ‘An evidence-based resource in the social sciences. Report of a
scoping study undertaken on behalf of the ESRC’, Unpublished report, York: Institute
for Research in the Social Sciences, University of York.
Shepherd, J., Harden, H., Rees, R., Brunton, G., Garcia, J., Oliver, S. and Oakley, A.
(2001) Young people and healthy eating: A systematic review of research on barriers and
facilitators, London: EPPI-Centre, Social Science Research Unit, Institute of
Education, University of London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/
hp/reviews.htm).
Shepherd, J., Weston, R., Peersman, G. and Napuli, I. Z. (2004) ‘Interventions for
encouraging sexual lifestyles and behaviours intended to prevent cervical cancer’
(Cochrane Review), in The Cochrane Library, issue 2, Chichester: John Wiley & Sons
Ltd.
Skidelsky, R. (1992) John Maynard Keynes: A biography. Vol 2: The economist as saviour, 1920-
1937, London: Macmillan.
Smith, A.F. (1996) ‘Mad cows and ecstasy: chance and choice in an evidence-based
society’, Journal of the Royal Statistical Society A, vol 159, no 3, pp 367-83.
Solesbury, W. (2002) ‘The ascendancy of evidence’, Planning Theory and Practice, vol 3, no
1, pp 90-6.
Stocking, B. (1993) ‘Implementing the findings of Effective Care in Pregnancy and
Childbirth in the United Kingdom’, Milbank Quarterly, vol 71, no 3, pp 497-522.
Strategic Policy Making Team (1999) Professional policy making for the twenty first century,
London: Cabinet Office (www.policyhub.gov.uk/bpmaking/index.asp).
Thomas, G. and Pring, R. (eds) (2004) Evidence-based practice in education, Maidenhead:
Open University Press.
Thomas, J., Harden, A., Oakley, A., Oliver, S., Sutcliffe, K., Rees, R., Brunton, G. and
Kavanagh, J. (2004) ‘Integrating qualitative research with trials in systematic reviews’,
British Medical Journal, vol 328, no 7446, pp 1010-12.
Thomas, J., Sutcliffe, K., Harden, A., Oakley, A., Oliver, S., Rees, R., Brunton, G. and
Kavanagh, J. (2003) Children and healthy eating: A systematic review of barriers and
facilitators, London: EPPI-Centre, Social Science Research Unit, Institute of
Education, University of London (www.eppi.ioe.ac.uk/EPPIWeb/home.aspx?page=/
hp/reviews.htm).
Tooley, J. and Darby, D. (1998) Educational research: A critique. A survey of published
educational research, London: Office for Standards in Education.
Torrance, H. (2004) ‘Quality in qualitative evaluation – a (very) critical response’,
Building Research Capacity, May, pp 8-10.
Delivered by Ingenta to: University of Sussex
IP : 139.184.30.134 On: Sat, 25 Feb 2012 17:50:07
Copyright The Policy Press
Ann Oakley, David Gough, Sandy Oliver and James Thomas 31
Evidence & Policy • vol 1 • no 1 • 2005 • 5-31
Torrance, H. and Coultas, J. (2004) Do summative assessment and testing have a positive or
negative effect on post-16 learners’ motivation for learning in the learning and skills sector? A
review of the research literature on the assessment of post-compulsory education in the UK,
London: Learning and Skills Research Centre (www.lsrc.ac.uk/publications/
index.asp).
Weiss, C. with Bucuvalas, M. (1980) Social science research and decision-making, New York,
NY: Columbia University Press.
White, D.G. (2001) ‘Evaluating evidence and making judgements of study quality: loss of
evidence and risks to policy and practice decisions’, Critical Public Health, vol 11, no 1,
pp 3-17.
Wilson, B., Thornton, J.G., Hewison, J., Lilford, R.J., Watt, I., Braunholtz, D. and
Robinson, M. (2002) ‘The Leeds University maternity audit project’, International
Journal for Quality in Health Care, vol 14, no 3, pp 175-81.
Wootton, B. (1967) In a world I never made: Autobiographical reflections, London: Allen and
Unwin.
Young, K., Ashby, D., Boaz, A. and Grayson, L. (2002) ‘Social science and the evidencebased
policy movement’, Social Policy and Society, vol 1, no 3, pp 215-24.

Realist Synthesis: Supplementary reading 2:

‘In search of a method’

Evidence Based Policy:

I. In Search of a Method

Ray Pawson

ESRC Centre for Evidence Based Policy

Queen Mary

University of London

r.d.pawson@leeds.ac.uk

Submitted to Evaluation: May 2001. Revised October 2001.

Evidence Based Policy:

I. In Search of a Method

Abstract

Evaluation research is tortured by time constraints. The policy cycle revolves quicker than the research cycle, with the result that ‘real time’ evaluations often have little influence on policy making. As a result, the quest for Evidence Based Policy (EBP) has turned increasingly to systematic reviews of the results of previous inquires in the relevant policy domain. However, this shifting of the temporal frame for evaluation is in itself no guarantee of success. Evidence, whether new or old, never speaks for itself. Accordingly, there is debate about the best strategy of marshalling bygone research results into the policy process. This paper joins the imbroglio by examining the logic of the two main strategies of systematic review, namely ‘meta-analysis’ and ‘narrative review’. Whilst they are often presented as diametrically opposed perspectives, this paper argues that they share common limitations in their understanding of how to provide a template for impending policy decisions. This review provides the background for Part II of the paper (to published in the next edition, Evaluation xxxx), which considers the merits of a new model for EBP, namely ‘realist synthesis’.

Key words: Evidence-Based Policy, Methodology, Systematic Review, Meta-analysis, Narrative Review, Realism

Introduction

My topic here is ‘learning from the past’ and the contribution that empirical research can make to that cause. Whether policy-makers actually learn from or simply repeat past mistakes is something of a moot point. What is clear is that policy research has much to gain by following the sequence whereby social interventions are mounted in trying, trying, and then in trying again to tackle the stubborn problems that confront modern society. This is the raison d’être behind the current explosion of interest in evidence based policy (EBP). Few major public initiatives these days are mounted without a sustained attempt to evaluate them. Rival policy ideas are thus run through endless trials with plenty of error and it is often difficult to know which of them really have withstood the test of time. It is arguable, therefore, that the prime function of evaluation research should be to take the longer view. By building a systematic evidence base that captures the ebb and flow of programme ideas we might be able to adjudicate between contending policy claims and so capture a progressive understanding of ‘what works’. Such ‘resolutions of knowledge disputes’, to use Donald Campbell’s (1974) phase, are the aim of the research strategies variously known as ‘meta-analysis’, ‘review’ and ‘synthesis’.

Standing between this proud objective and its actual accomplishment lies the minefield of social science methodology. Evaluation research has come to understand that there is no one ‘gold standard’ method for evaluating single social programmes. My starting assumption is that exactly the same maxim will come to apply to the more global ambitions of systematic review. It seems highly likely that EPB will evolve a variety of methodological strategies (Davies, 2000). This paper seeks to extend that range by offering a critique of the existing orthodoxy, showing that the lessons currently learnt from the past are somewhat limited. These omissions are then gathered together in a second part of the paper to form the objectives a new model of EBP, which I call ‘realist synthesis’.

Section 1.1 of present paper offers a brief recapitulation of the overwhelming case for the evaluation movement to dwell on and ponder evidence from previous research.

Sections 1.2 and 1.3 consider the main current methods for doing so, which I split into two perspectives – namely, ‘numerical meta-analysis’ and ‘narrative review’. A detailed critique of these extant approaches is offered. The two perspectives operate with different philosophies and methods and are often presented in counterpoint. There is a heavy whiff of the ‘paradigm wars’ in the literature surrounding them and much mud slinging on behalf of the ‘quantitative’ versus the ‘qualitative’, ‘positivism’ versus ‘phenomenology’, ‘outcomes’ versus ‘process’ and so on. This paper eschews any interest in these old skirmishes and my critique is, in fact, aimed at a common ambition of the two approaches. In their different ways, both aim to review a ‘family of programmes’ with the goal of selecting out the ‘best buys’ to be developed in future policy-making. I want to argue that, despite the arithmetic clarity and narrative plenitude of their analysis, they do not achieve decisive results. The former approach makes no effort to understand how programmes work and so any generalisations issued are insensitive to differences in programme theories, subjects and circumstances that will crop up in future applications of the favoured interventions. The latter approach is much more attuned to the contingent conditions that make for programme success but has no formal method of abstracting beyond these specifics of time and place, and so is weak in its ability to deliver transferable lessons. Both approaches thus struggle to offer congruent advice to the policy architect, who will always be looking to develop new twists to a body of initiatives, as well as seeking their application in fresh fields and to different populations.

1.1 ‘Meta’ is better

Whence EBP? The case for using systematic review in policy research rests on a stunningly obvious point about the timing of research vis-à-vis policy – namely, that in order to inform policy, the research must come before the policy. To figure this out does not require an ESRC approved methodological training, nor a chair in public policy, nor years of experience in Whitehall. Yet, curiously, this proposition does not correspond to the sequence employed in the procurement most evaluation research. I depict the running order of a standard piece of evaluation (if there be such a thing) in the upper portion of figure one.

FIGURE ONE ABOUT HERE

The most significant manifestation of public policy in the modern era is the ‘programme’, the ‘intervention’, or the ‘initiative’. The gestation of such schemes follows the familiar design  implementation  impact sequence, each phase being associated with a different group of stakeholders, namely programme architects, practitioners and participants. As a rule of thumb, one can say that evaluation researchers are normally invited to enter the fray in the early phases of programme implementation. Another iron law of research timing is that the researchers are usually asked to report on programme impact somewhat before the intervention has run its course. Evaluations, therefore, operate with a rather narrow bandwidth and the standard ‘letting and reporting’ interval is depicted in figure one.

Much ink and vast amounts of frustration have flowed in response to this sequencing. There is no need for me to repeat the all the epithets about ‘quick and dirty’, ‘breathless’ and ‘brownie point’ evaluations. The key point I wish to underscore here is that, under the traditional running order, programme design is often a research-free zone. Furthermore, even if an evaluation manages to be ‘painstaking and clean’ under present conditions, it is still often difficult to translate the research results into policy action. The surrounding realpolitik means that, within the duration of an evaluation, the direction of political wind may well change, with the fundamental programme philosophy being (temporarily) discredited, and thus not deemed worthy of further research funding. Moreover, given the turnover in and the career ambitions of policy-makers and practitioners, there is always a burgeoning new wave of programme ideas waiting their turn for development and evaluation. Under such a regime, we never get to Campbellian ‘resolution to knowledge disputes’ because there is rarely a complete revolution of the ‘policy-into-research-into-policy’ cycle.

Such is the case for the prosecution. Let us now turn to a key manoeuvre in the defence of using the evidence-base in programme design. The remedy often suggested for all this misplaced, misspent research effort is to put research in its appropriate station (at the end of the line) and to push many more scholars back where they belong (in the library). This strategy is illustrated in the lower portion of figure one. The information flow therein draws out the basic logic of systematic review, which takes as its starting point the idea that there is nothing entirely new in the world of policy-making and programme-architecture. In the era of global social policy, international programmes and cross-continental evaluation societies, one can find few policy initiatives that have not been tried and tried again, and researched and researched again. Thus, the argument goes, if we begin inquiry at that point where many similar programmes have run their course and the ink has well and truly dried on all of the research reports thereupon, we may then be in a better position to offer evidence-based wisdom on what works and what does not.

On this model, the key driver of research application is thus the feedback loop (see fig one) from past to present programming. To be sure, this bygone evidence might not quite correspond to any current intervention. But since policy initiatives are by nature mutable and bend according to the local circumstances of implementation, then even real-time research (as we have just noted) has trouble keeping apace.

Like all of the best ideas, the big idea here is a simple one - that research should attempt to pass on collective wisdom about the successes and failures of previous initiatives in particular policy domains. The prize is also a big one in that such an endeavour could provide the antidote to policy-making’s frequent lapses into crowd-pleasing, political-pandering, window-dressing and god-acting. I should add that the apparatus for carrying out systematic reviews is also by now a big one, the recent mushrooming of national centres and international consortia for EBP being the biggest single change on the applied research horizon for many a year¹.

Our scene is thus set. I turn now to an examination of the methodology of EPB and enter a field with rather a lot of aliases – meta-analysis, narrative review, research synthesis, evaluation synthesis, overview, systematic review, pooling, structured review and so on. A complete review of review methodology is impossible other than at book length, so I restrict my attention here entirely to the analytical matter of how the evidence is synthesised. The present paper thus pays no heed to the vital issue of the quality of the primary research materials, nor to the way that the evidence, once assembled, may find its way into use. And as for modes of analysis, I again simplify from the menu of strategies available in order to contrast two approaches, one of which see synthesis as a statistical task as opposed to another which assumes knowledge cumulates in narrative.

1.2 Numerical meta-analysis

The numerical strategy of EBP, often just referred to as ‘meta-analysis’, is based on a three-step – ‘classify’, ‘tally’ and ‘compare’ model. The basic locus of analysis is with a particular ‘family of programmes’ targeted at a specific problem. Attention is thus narrowed to the initiatives developed within a particular policy domain (be it ‘HIV/AIDS-prevention schemes’ or ‘road-safety interventions’ or ‘neighbourhood-watch initiatives’ or ‘mental health programmes’ or whatever). The analysis begins with the identification of sub-types of the family, with the classification based normally on alternative ‘modes of delivery’ of that programme. Since most policy-making is beset with rival assertions on the best means to particular ends, meta-evaluation promises a method of sorting-out these contending claims. This is accomplished by compiling a data-base examining existing research on programmes making up each sub-type, and scrutinising each case for a measure of its impact (net effect). The overall comparison is made by calculating the typical impact (mean effect) achieved by each of the sub-types within the overall family. This strategy thus provides a league table of effectiveness and a straightforward measure of programme ‘best buy’. By following these steps, and appending sensible caveats about future cases not sharing all of the features of the work under inspection, the idea is to give policy-architects some useful pointers to the more promising areas for future development.

A typical illustration of numerical strategy is provided in Table 1, which comes from Durlak and Wells’ (1997) meta-analysis of 177 Primary Prevention Mental Health (PPMH) programmes for children and adolescents carried out in the US. According to the authors, PPMH programmes ‘may be defined as an intervention intentionally designed to reduce incidence of adjustment problems in currently normal populations as well as efforts directed at the promotion of mental health functioning’. This broad aim of enhancing psychological well-being by promoting the capacity to cope with potential problems can be implemented through a variety of interventions, which were classified by the researchers as in column one of table one. First of all there is the distinction between person-centred programmes (which use counselling, social learning and instructional approaches) and environment-centred schemes (modifying school or home conditions to prepare children for life change). Then there are more specific interventions targeted at ‘milestones’ (such as the transitions involved in parental divorce or teenage pregnancy and so on). A range of further distinctions may also be discerned in terms of programmes with specific target groups, such as those focused on specific ‘developmental levels’ (usually assigned by ‘age ranges’) or aimed at ‘high risk’ groups (assigned, for instance, as children from ‘low-income’ homes or with ‘alcoholic parents’ etc.).

TABLE 1 ABOUT HERE

A standard search procedure (details in Durlak and Wells, p.120) was then carried out to find research on cases falling under these categories, resulting in the tally of initiatives in column two of the table. For each study the various outcomes assessing the change in the child/adolescent subjects were unearthed and ‘the effect sizes (ESs) were computed using the pooled standard deviation of the intervention and control group’. After applying corrections for the small sample sizes in play in some of the initiatives, the mean effect of each class of programmes was calculated. This brings us finally to the objective of meta-analysis – a hierarchy of effectiveness - as in column three.

Let us now move to a critique of this strategy, remembering that I am seeking out weaknesses in the logic of whole genre and not just this example. The basic problem concerns the nature and number of the simplifications that are necessary to achieve the column of net effects. This - the key output of meta-analysis - is, has to be, and is intended to be a grand summary of summaries. Each individual intervention is boiled down to a single measures of effectiveness, which is then drawn together in an aggregate measure for that sub-class of programmes, which is then compared with mean effect for other intervention categories. The problem with this way of arriving at conspectus is that it squeezes out vital explanatory content about the programmes in action, in a way that renders the comparisons much less rigorous than the arithmetic appears on paper. This process of compression occurs at three points:

the melding of programme mechanisms
the oversimplification of programme outcomes
the concealment of programme contexts

i) Melded mechanisms. The first difficulty follows from an absolutely taken-for-granted presumption about the appropriate locus of comparison in meta-analysis. I have referred to this as a ‘programme family’ above. Now on this method, it is simply assumed that the source of family resemblance is the ‘policy domain’. It one sense this is not at all unreasonable. Modern society parcels up social ills by administrative domain and assembles designated institutes, practitioners, funding regimes and programmes to tackle each problem area. Moreover, it is a routine feature of problem solving that within each domain there will be some disagreement about the best way of tackling its trademark concerns. Should interventions be generic or targeted at sub-populations? Should they be holistic or problem specific? Should they be aimed at prevention or cure? Should they have an individual-focus, or be institution-centred, or be area-based? Such distinctions feed their way onwards into the various professional specialities and rivalries that feature in any policy sector. This in turn creates the field of play for meta-evaluation, which is asked to adjudicate on whether this or that way of tackling the domain problem works best? In the case at hand, we are dealing with the activities of a professional speciality, namely ‘community psychology’, which probably has a greater institutional coherence in the US than the UK, but whose sub-divisions into ‘therapists’, ‘community educators’ and so on would be recognised the world over.

Now, this is the policy apparatus that generates the programme alternatives that generate the meta-analysis question. Whether it creates a commensurable set of comparisons and thus a researchable question is, however, a moot point. And this brings me to the nub of my first critique, which is to cast doubt on whether such an assemblage of policy alternatives constitutes a comparison of ‘like-with-like’. Any classification system must face the standard methodological expectations that it be unidimensional, totally inclusive, mutually exclusive and so forth. We have seen how the different modes of programme delivery and the range of targets form the basis of Durlak and Wells’ classification of PPMH programmes. But the question is, does this classification system provide us with programme variations on a common theme that can be judged by the same yardstick, or are they incommensurable interventions that should be judged in their own terms?

One can often gain a pointer to this issue by considering the reactions of authors whose studies have been grist to the meta-analysis mill. Weissberg and Bell (1997) were responsible for three out of the twelve studies reviewed on ‘interpersonal problem solving for 7-11 year-olds’ and their barely statistically-significant efforts are thus to be found down there in the lower reaches of net-effects league-table. They protest that their three inquiries were in fact part of a ‘developmental sequence’, which saw their intervention change from one with 17 to 42 modules, and as programme conceptualisation, curriculum, training and implementation progressed, outcome success also improved in the three trials. As well as wanting ‘work-in-progress’ struck out of the review, they also point out that programmes frequently outgrow their meta-analytic classification. So their ‘person-centred’ intervention really begins to bite, they claim, when it influences whole teaching and curriculum regimes and thus becomes so-to-speak, ‘environment-centred’. The important point for the latter pair of authors is that the crux of their programme, its ‘identifier’ so to speak, is the programme theory. In their case, the proposed mechanism for change was the simultaneous transformation of both setting and teaching method so that the children’s thinking became more oriented to problem-solving.

Weissberg and Bell also offer some special pleading for those programmes (not of their own) that come bottom of the PPMH meta-evaluation scale. Parenting initiatives, uniquely amongst this programme-set, attempt to secure improvements in children’s mental health through the mechanisms of enhancing child-rearing practices and increasing child-development knowledge of parents and guardians. This constitutes a qualitatively different programme theory, which may see its pay-off in transformations in daily domestic regimes and in long-term development in the children’s behaviour. This theory might also be particularly sensitive to the experience of the parent-subjects (first-time parents rather than old moms and dads being readier to learn new tricks). It also suffer from problems of take-up (such initiatives clearly have to be voluntary and the most needful parents might be the hardest to reach).

Such a programme stratagem, therefore, needs rather sophisticated testing. Ideally this would involve long-term monitoring of both family and child; it would involve close scrutiny of parental background; and, if the intervention were introduced early in the children’s lives, it would have to do without before-and-after comparisons of their attitudes and skills. When it comes to meta-analysis, no such flexibility of the evidence-base is possible and Durlak and Wells’ studies all had to face the single, standard test of success via measures monitoring pre-intervention – post-intervention changes in the child’s behaviour. A potential case for the defence of parenting initiatives thus lurks, on the grounds that what is being picked up in meta-analysis in these cases is methodological heavy-handedness rather than programme failure.

Note well that my point here is not about ‘special pleading’ as such; I am not attempting to speak up for ‘combined’ programmes or ‘parenting’ programmes or any specific members of the PPMH family. Like the typical meta-analyst, I am not close enough to the original studies to make judgements on the rights and wrongs of these particular disputes. The point I am underlining is that programme sub-categories just do not speak for themselves. The classifications are not just convenient, neutral labels that automatically allow inspection according to common criterion. This is especially so if the sub-types follow bureaucratic distinctions, which provide for rival programme cultures, which in turn seed the framework with different programme theories.

If the classification frames do pick up different programme ideas or mechanisms, then it is important that initiatives therein be judged by appropriate standards and this can break the back of any uniform measurement apparatus. The greater the extent that different programme mechanisms are introduced into meta-analytic comparisons, the more we bring into play differences in intervention duration and development, practitioner experience and training, subject self-selection and staying power, and so forth, which are themselves likely to influence the outcomes. In short, the categories of meta-analysis have the potential to hide, within and between themselves, very many significant capacities for generating personal and social change and we thus need to be terribly careful, therefore, about making any causal imputations to any particular ‘category’.

ii) Oversimplified Outcomes. Such a warning looms even more vividly when we move from cause to effect. This brings us programme outcomes and the second problem with numerical meta-analysis, which is concealed in that rather tight-fisted term - the ‘mean effect’. The crucial point to recall as we cast our eyes down the outputs of meta-analysis, such as column three in table one, is that the figures contained therein are means of means of means of means! It is useful to travel up the chain of aggregation to examine how exactly the effect calculations are performed for each sub-category of programme. Recall that PPMH programmes carry the broad aims of increasing ‘psychological well-being’ and tackling ‘adjustment problems’. These objectives were put to the test within each evaluation by the standard method of performing before-and-after calculations on indicators of the said concepts.

Mental health interventions have a long pedigree and so the original evaluations were able to select indicators of change from a whole battery of ‘attitude measures’, ‘psychological tests’, ‘self-reports’, ‘behavioural observations’, ‘academic performance records’, ‘peer approval ratings’, ‘problem-solving vignettes’, ‘physiological assessments of stress’ and so on. As well as varying in kind, the measurement apparatus for each intervention also had a potentially different time dimension. Thus ‘post-test’ measures will have had different proximity to the actual intervention and in some, cases but not others, were applied in the form of third and subsequent ‘follow-ups’. Further, hidden diversity in outcome measures follows from the possibility that certain indicators (cynics may guess which!) may have been used but have gone unreported in the original studies, given limitations on journal space and pressures to report successful outcomes.

These then are the incredibly assorted raw materials through which meta-analysis traces programme effects. Now, it is normal to observe some variation in programme efficacy across the diverse aspects of personal change sought in an initiative (with it being easier to shift, say, ‘self-reported attitudes’ than ‘anti-social behaviour’ than ‘academic achievement’). It is also normal to see programme effects change over time (with many studies showing that early education gains may soon dissipate but that interpersonal-skills gains are more robust – c.f. McGuire, 1997). It is also normal in multi-goal initiatives to find internal variation in success across the different programme objectives (with the school-based programmes having potentially more leverage on classroom-based measures about ‘discipline referrals’ and ‘personal competence’ rather than on indicators shaped more from home and neighbourhood, such as ‘absenteeism’ and ‘dropout rates’). In short, programmes always generate multiple outcomes and much is to be learnt about how they work by comparing their diverse impacts within and between programmes and over time.

There is little opportunity for such flexibility in meta-analysis, however, because any one study becomes precisely that – namely ‘study x’ of, for instance, the 15 school-based studies. Outcome measurement is a tourniquet of compression. It begins life by observing how each individual changes on a particular variable and then brings these together as the mean effect for the programme subjects as a whole. Initiatives normally have multiple effects and so its various outcomes, variously measured are also averaged as the ‘pooled effect’ for that particular intervention. This, in turn is melded together with the mean effect from ‘study y’ from the same subset, even though it may have used different indicators of change. The conflation process continues until we gather in all the studies within the sub-type, even though by then the aggregation process may have fetched in an even wider permutation of outcome measures. Only then do comparisons begin as we eyeball the mean, mean, mean effects from other sub-categories of programmes. Meta-analysis, in short, will always generate its two-decimal-place mean effects - but since it squeezes out much ground-level variation in the outcomes, it remains open to the charge of spurious precision.

iii) Concealed Contexts. My third critique puts the ‘like-with-like?’ test to a further element of all social programmes, namely the subjects who, and situations which, are on the receiving end of the initiatives. No individual-level intervention works for everyone. No institution-level intervention works everywhere. The net effect of any particular programme is thus made up of the balance of successes and failures of individual subjects and locations. Thus any ‘programme outcome’ - single, pooled or mean - depends not merely upon ‘the programme’ but also on its subjects and its circumstances. These contextual variations are yet another feature that is squeezed out of the picture in the aggregation process of meta-analysis.

Some PPMH programmes, as we have seen, tackle longstanding issues of the day like educational achievement. Transforming children’s progress in this respect has proved difficult, not for want of programmes, but because of the educational system as a whole is locked into a wider range of social and cultural inequalities. The gains made under a specific initiative are thus always limited by such matters as the class and racial composition of the programme subjects and, beyond that, by the presence or absence of further educational and job opportunities. The same programme may thus succeed or fail according to how advantageous is its school and community setting. At the aggregate level, this proposition generates a rival hypothesis about success in the ‘league tables’. Unless we collect information about the prevailing contextual circumstances of the programme, it might be that the accumulation of ‘soft targets’ rather than successful programmes per se could produce a winning formula.

The same point also applies even to the seemingly ‘targeted’ initiatives. For instance, the ‘milestone’ programmes in Durlak and Wells’ analysis make use of a ‘modelling’ mechanism. The idea is that passing on the accounts of children who have survived some trauma can be beneficial to those about to face the ‘same’ transition. The success of that hypothesis depends, of course, on how authentic and how evocative are the ‘models’. And this may be expected to vary according to the background of the child facing the transition. Different children may feel that the predicaments faced in the models are not quite the same as their own, that the backgrounds of the previous ‘victims’ are not quite parallel to their own, that the programme accounts do not square with other advice they are experiencing, and so on. It is quite possible (indeed quite routine) to have a programme that works well for one class of subjects but works against another and whose differential effects will be missed in the aggregation of outputs. This points to the importance of the careful targeting of programmes as a key lesson to be grasped in accumulating knowledge about programmes efficacy. But this is rather unlikely in meta-analysis, which works to the rival philosophy that the best programmes are the ones that work most widely.

Again, I must emphasise that the hypotheses expressed in the above two paragraphs are necessarily speculative. What they point to is the need for a careful look at subject and contextual differences in terms of who succeeds and who fails within any programme. And the point, of course, is that this is denied to a meta-analysis, which needs unequivocal and uniform base-line indicators for each case. At the extreme, we can still learn from a negative net effect since the application of an initiative to the wrong subjects and in the wrong circumstances can leave behind vital clues about what might be the right combination. No such subtlety is available in an approach that simply extracts the net effect from one study and combines it together with other programmes from the same stable to generate an average effect for that sub-class of programmes. Vital explanatory information is thus once again squeezed out automatically in the process of aggregation.

Let me now try to draw together the lessons of the three critiques of meta-analysis. What I have tried to do is cast doubt on the exactitude of the mean-effect calculations. Arithmetically speaking, the method always ‘works’ in that it will produce mechanically a spread of net effects of the sort we see in table 1. The all-important question of course is whether the league table of efficacy produced in this manner should act as an effective guide to future policy making. Should the PPMH meta-analyst advise, for instance, on cutting parenting programmes and increasing affective education and interpersonal problem-solving schemes, especially for the under-sevens? My answer would be an indubitable ‘no’. The net effects we see in column three do not follow passively from application of that sub-class of programmes but are generated by unmonitored differences in the underlying programme mechanisms, in the contexts in which the programme is applied and in the measures which are used to tap the outcome data.

The message, I trust, is clear - what has to be resisted in meta-analysis is the tendency for making policy-decisions on the casting of an eye down a net-effects column such as in table 1. The contexts, mechanisms and outcomes that constitute each set of programmes are so diverse that it is improbable that like gets compared with like. So whilst it might seem objective and prudent to make policy by numbers, the results may be quite arbitrary. This brings me to a final remark on Durlak and Wells and to some good news (and some bad). The PPMH meta-analysis is not in fact used to promote a case for or against any particular subset of programmes. So, with great sense and caution, the authors avoid presenting their conclusions as a case for thumbs-up for ‘interpersonal problem solving for the very young’, thumbs-down for ‘parenting programmes’, and so on. Indeed they view the results in table one as ‘across the board’ success and perceive that the figures support the extension of programmes and further research in PPMH as a whole.

Their reasoning is not so clear when it comes to step two in the argument. This research was done against the backdrop of the US Institute of Medicine’s 1994 decisions to exclude mental health promotion from their official definitions of preventative programmes, with a consequent decline in their professional status. Durlak and Wells protest that their finding (ESs of 0.24 to 0.93) compare favourably to the effect sizes in reported routinely in psychological, educational and behavioural treatments (they report one overview of 156 such meta-analyses came up with a mean of means of means of means of means of 0.47). And what is more ‘the majority of mean effects for many successful medical treatments such as by-pass surgery for coronary heart disease, chemotherapy to treat certain cancers …also fall below 0.5.’ If one compares for just a second the nature, tractability and seriousness of the assorted problems and the colossal differences programme ideas, subjects, circumstances and outcome measures involved in this little lot, one concludes that we are being persuaded, after all, that chalk can be compared to cheese.

Let me make it abundantly clear that my case is not just against this particular example. There are, of course, more sophisticated examples of meta-analysis than the one analysed here, not to mention many far less exacting studies. There are families of interventions for which meta-analysis is more feasible than the case considered here and domains where it is a complete non-starter. The most prevalent usage of the method lies in synthesising the results of clinical trials. In evidence based medicine, one is often dealing with a singular treatment (the application of a drug) as well as unidimensional and rather reliable outcome measures (death rates). Here meta-analysis might give important pointers on what to do (give clot-busting agents such as streptokinase after a coronary) and what not to do (subject women to unnecessary gynaecological treatment). Of course, problems ‘heterogeneity’ remain in EBM but they are not of the same order as in EBP, where interventions work though reasoning subjects rather than blinded patients.

Even in EBP, meta-analysis is under continuous development. And since I have had space here for only the single shot at a rapidly moving target, I close this section with some clarification of the scope of the critique. The first thing to point out is that I am not working under the delusion that some of the specific points made will wound meta-analysis mortally or even come as a shock to exponents of its arts. For instance, the ‘apples and oranges’ problem has bedevilled meta-analysis since the originating study, namely Smith and Glass’s (1977) review of the efficacy of psychotherapeutic treatments. Gallo (1978) attacked the very prototype for pooling a ‘hodgepodge’ of outcome measures and debate has followed ever since about whether meta-analysis should operate at the level of ‘fruit’ or really get down the differences between the ‘honeydew’ and ‘cantaloupe’ melons. Like all methodologies, meta-analysis is self-reflecting and its folk language carries some rueful, ironic expressions about its tendency to meld to mean, with the process of arriving at a summary effect size being known in the trade as ‘smelting’. Accordingly, methods of calculating this overall effect have shifted for the early days of ‘vote counting’ (how many studies show success and how many find failure), to the ‘weighted averages’ used in Durlak and Well’s study, to controversial attempts to weigh the ‘quality’ of study before tallying its recalculated effect size into the overall pile (Hunt, 1997).

In short, there are many technical problems to be overcome. But, note well, I am not attempting a technical critique here. What I am taking issue with is the basic ‘logic’ of meta-analysis. The central objective for evidence-into-policy linkage on this strategy is to go in search of the stubborn empirical generalisation. To put this rather more precisely, the goal is discover what works by the replication of positive results for a class of programmes in a range of different situations. This is rather like the misidentification of ‘universal laws’ and ‘empirical generalisations’ in natural science (Kaplan, 1964). The discovery of former does not imply or require the existence of latter. To arrive at a universal law we do not require that ‘X-is-always-followed-by-Y’. Rather, the expectations is that ‘Y’ always follows ‘X’ in certain prescribed situations and that we need a programme of theory and research to delimit those situations. Thus gunpowder does not always explode if a match is applied - but that consequence depends (on other things) on it being dry; the pressure of a gas does not always increase linearly with temperature - unless there is a fixed mass; falling bodies like cannon balls and feathers don’t follow the laws of motion - unless we take into account friction, air resistance and so on. The meta-analytic search for ‘heterogeneous replication’ (Shadish 1991) is rather akin to the search for the brute empirical generalisation in that it seeks programme success without sufficient knowledge of the conditions of success.

Note further, that I extend the critique to what is sometimes called ‘second-level’ meta-analysis (Hunt, 1997). This approach goes beyond the idea of finding mean effects for a particular class of treatment. It assumes, correctly, that the average is made up of a range of different effects, coloured by such factors as: i) the characterises of the clients, ii) the length, intensity of the treatment, and even iii) the type of design in the original study. Studies attempting to identify such ‘mediators’ and ‘moderators’ (Shadish and Sweeney, 1991) do make an undoubted leap forward, with some recent meta-analyses of medical trials identifying significant differences in the efficacy of treatment for different sub-groups of patients. Perhaps not surprisingly, the same ‘dosage’ turns out to have differential effects according to the strain and advance of the disease as well as the age, weight, sex and metabolism of the subject. The trouble is that such lists have a habit of going on and on, with one meta-analysis study of the efficacy of a vaccine for tuberculosis discovering that its effects improved the further was the distance of the study site from the equator (Colditz et al, 1995).

Rather than a league table of effect sizes, second-level meta-analysis produces as its outcome a ‘meta-regression’ - a casual model identifying associations between study or subject characteristic and the outcome measure. Much else in the analysis will remain the same, however. A limited rage of mediators and moderators is likely to be selected according to both the limits of the researchers’ imaginations (not too many have gone for line of latitude) and by the lack of consistent information on potential mediators from primary study to primary study. And if, say, subject characteristics are the point of interest, the other dilemmas illustrated above, such as the melding of programme mechanism and the pooling of outcome measures may well still be evident. The result is that meta-regression offers a more complex summary of certain aspects of programme efficacy but one that never accomplishes the privileged status of ‘generalisation’ or ‘law’. The result according to one rather authoritative source is:

‘In many a meta-analysis we have a reliable, useful, causal description but without any causal explanation. I think that the path models have a heuristic value but often seduce the reader and the scholar into giving them more weight than they deserve.’ (Thomas Cook quoted in Hunt, 1997, p79)

From the point of view of the earlier analysis, this has to be the correct diagnosis. For the realist, programmes do not have causal power. Interventions offer subjects resources, which they then accept of reject, and whether they do so depends on their characterises and circumstances. And as I have attempted to show with the PPMH examples, there is an almighty range of different mechanisms and contexts at work, which produce a vast array of outcome patterns. The problem is the classic one of the over-determination of evidence by potential theories. This suggests that rather than being a process of interpolation and estimation, systematic review should attempt the task of extrapolation and explanation. And this line of argument is picked up in the second part of the paper.

Any further discussion of meta-analysis must be curtailed at this stage. Readers interested in the latest debate on its second and even third levels should compare the points of view of Sutton et al and Robertson in a recent issue of the Journal of Evaluation of Clinical Practice (2001). The attention here now turns to causal interpretations of a quite different ilk.

1.3. Narrative reviews

We move to the second broad perspective of EBP, which comes in a variety of shapes and sizes and also goes by an assortment of names. In an attempt at a catch-all expression, I shall refer to them as ‘narrative reviews’. Again, I will seize upon a couple of examples, but once again note that my real objective is to capture and criticise the underlying ‘logic’ of the strategy.

The overall aim of the narrative approach is, in fact, not a million miles from the numerical strategy in that a family of programmes is examined in the hope of finding those particular approaches that are most successful. Broadly speaking, one can say that there is a slightly more pronounced explanatory agenda within the strategy, with the reviews often having something to say on why the favoured interventions were in fact successful. In methodological terms, however, the significant difference between the two core approaches lies in respect of the following (related) questions:

What information should be extracted from the original studies?
How should the comparison between different types of initiatives be achieved?

On the first issue, the contrast is often drawn between the numerical approach, which is said only to have eyes for outcomes, whereas narrative reviews are pledged to preserve a ‘ground-level view’ of what happened with each programme. The meaning of such a commitment is not always clear and this accounts for some variations in the way it is put into practice. One detectable strategy at work is the tell-it-like-it-is philosophy of attempting to preserve the ‘integrity’ or ‘wholeness’ of the original studies. Another way of interpreting the desideratum is that the review should extract enough of the ‘process’ of each programme covered so that its ‘outcome’ is rendered intelligible to the reader of the review.

The interesting thing about these goals is that they are very much the instincts of ‘phenomenological’ or ‘case study’ or even ‘constructivist’ approaches to applied social research. The great difficulty facing narrative review is to preserve these ambitions over and over again as the exercise trawls through scores or indeed hundreds of interventions. The same problem, incidentally, haunts comparative-historical inquiry as it attempts to preserve ‘small n’ historical detail in the face of ‘large N’ international comparisons (Ragin, 1987).

The result is that data extraction in narrative reviews is always something of a compromise. The difficulty can be highlighted by pausing briefly on a ‘worse case scenario’ from the genre. Walker’s Injury Prevention for Young Children: A Research Guide (1996) provides a particularly problematic example for, despite its sub-title, it is little more than an annotated bibliography. The format is indeed an index of inquiries – taking the shape of what look suspiciously like the paraphrased abstracts of 370 studies. The working assumption, one supposes, is that abstracts are supposed to provide the ‘essence’ of the programme and the ‘gist’ of the research findings, and these are the vital raw ingredients of review. The summaries are then subdivided into nine sub-sections dealing with different forms of injury (asphyxiation to vehicle injury). Each subsection begins with brief, paragraph-length introductions, which go no further than describing the broad aims of the programmes therein.

Now, whilst these pocket research profiles provide enough material to spark interest in particular entries and indifference with respect to others, such a process hardly begins the job of comparing and contrasting effectiveness. Seemingly, this task is left to the reader, who is in a hopeless position to do so systematically, since the entries merely log the concerns of the original authors and thus contain utterly heterogeneous information. Note that I am not slaying a straw man in the name of a whole strategy here, merely pointing out an acute form of underlying tension between the goals of ‘revealing the essence of each case study’ and ‘affecting a comparison of all case studies’.

A giant step on from here, within the narrative tradition, is what is sometimes called the ‘descriptive-analytical’ method. Here studies are combed to a common analytical framework, with the same template of features being applied to each study scrutinised. A model example, from the same field of childhood accident prevention, is to be found in the work of Towner et al (1996). Appendix H in that study supplies an example of a ‘data extraction form’, alas too long to reproduce here, which is completed for each study reviewed, collecting information as follows:

1. Author, year of publication, place
2. Target group, age-range, setting
3. Intervention aims and content
4. Whether programme is educational, environmental or legislative
5. Whether alliances of stakeholders were involved in programme implementation
6. Methodology employed
7. Outcome measures employed
8. Summary of important results
9. Rating of the ‘quality of evidence’

This represents a move from trying to capture the essence of the original studies via an ‘abstract/summary’ to attempting to locate their key aspects on a ‘data matrix’. The significant point, of course, is that, being a narrative approach, the cell entries in these tabulations are composed of mainly of text. This text can be as full or as brief as research time and inclination allows. In its raw form, the information on any single study can thus easily run up to two or three pages. This may include the extraction of a key quotation from the original authors, plus some simple tick-box information on, say, the age range of the target group. The record of findings can range from effect sizes to the original author’s thoughts on the policy implications of their study. And furthermore, the review may also stretch include the reactions of the reviewer (e.g. ‘very good evaluation – no reservations about the judgements made’). In short, the raw data of narrative review can take the form of quite a mélange of different types of information.

Normally such reviews will also provide appendices with the entire matrix on view in a condensed form, so one can literally ‘read across’ from case to case, comparing them directly on any of the chosen features. A tiny extract from Towner’s (1996) summary of 56 road safety initiatives is reproduced in table two in order to provide a glimpse of the massive information matrix.

TABLE TWO ABOUT HERE

There is little doubt that such a procedure provides an incomparable overview of ‘what is going on in’ in a family of evaluation studies. There is one clear advantage over the numerical approach in that it admits a rather more sophisticated understanding of how programmes work. Meta-analysis is somewhat trapped by its ‘successionist’ understanding of causality. The working assumption is that it is the programme ‘x’ that causes the outcome ‘y’, and the task is to see which subtype of initiative (x₁, x₂, x₃ etc.) or which mediators (m₁, m₂, m₃ etc.) has the most significant impact.

In narrative review, the programme is not seen as a disembodied feature with its own causal powers. Rather programmes are successful when (imagine reading across the data matrix/extraction form) the right intervention type, with clear and pertinent objectives, comes into contact with an appropriate target group, and is administered and researched by an effective stakeholder alliance, working to common goals, in a conducive setting, and so on. This logic utilises a ‘configurational’ approach to causality, in which outcomes are considered to follow from the alignment, within a case, of a specific combination of attributes. Such an approach to causation is common amongst narrative inquiry of all kinds and features strongly, once again, in the methodology of comparative historical research (Ragin, 1987, p125). There is no need to labour the point here but one can see a clear parallel, for instance, in the logic used for explaining programme efficacy and that applied in charting the success of a social movement. Explanations for why England experienced an early industrial revolution turn on identifying the combination of crucial conditions such as ‘technological innovation’, ‘weak aristocracy’, ‘commercialised agriculture’, ‘displaced peasantry’, ‘exploitable empire’ and so on (Moore, 1966, ch1).

However, the capacity of the narrative approach for teasing out such a ‘holistic’ understanding of programme success by no means completes the trick of accomplishing a review. The next and crucial question concerns how comparisons are brought to bear upon such configurational data. How does one draw transferable lessons when one set of attributes might have been responsible for success in programme ‘A’ and a different combination might account for achievements of programme ‘B’? My own impression is that the ‘logic of comparison’ has not been fully articulated in the narrative tradition, and so there is considerable variation in practice on this matter. Simplifying, I think there are three discernible ‘analytic strategies’ in use, as discussed in the following paragraphs. All three are often found in combination, though there is a good case for arguing that approach ‘b’ (below) is the most characteristic:

Let the reviews speak for themselves.

No one should forget that systematic review is a long and laborious business, which itself occupies the full research cycle. It all begins with difficult decisions on the scope of the programmes to be brought under review. Then the sample of designated cases is located by all manner of means from key-word searches to word-of-mouth traces. Then comes the wrestling match to find a common template on which to code the multifarious output, be it learned, grey, or popular. Then comes the matter of data extraction itself, with its mixture of mechanical coding and guesswork, as the researchers struggle to translate the prose styles of the world into a common framework. Then, if the business is conducted properly, comes the reliability check, with its occasionally shocking revelations about how research team members can read the same passage to opposite effects. Then, finally, comes the quart-into-pint-pot task of presenting the mass of data into to an intelligible set of summary matrices and tables.

Little wonder then, that there is tendency in some systematic reviews to draw breath and regard the completion of this sequence as the ‘job done’. On this view, EBP is seen largely as a process of condensation, a reproduction in miniature of the various incarnations of a policy idea. The evidence has been culled painstakingly to enable a feedback loop into fresh policy-making - but the latter is a task to be completed by others.

However sympathetic one might be to the toils of information science, this ‘underlabourer’ notion of EBP must be given very short shrift. Evidence, new or old, numerical or narrative, diffuse or condensed, never speaks for itself. The analysis and usage of data is a sense-making exercise and not a mechanical one. Social interventions are descriptively inexhaustible. Of the infinite number of events, actions and thoughts that make up a programme rather few get set down in research reports and rather fewer get recorded in research review. It is highly unlikely, for instance, that the evidence base is will contain data on the astrological signs of project participants or the square-footage of the project headquarters. It is relatively unlikely that the political allegiance of the programme architects or size of the intervention funding will be systematically reviewed. In other words, certain explanations for programme success or failure are favoured or disqualified automatically - according to the particular information selected for extraction.

Since it cannot help but contain the seeds of explanation, systematic review has to acknowledge and foster an interpretative agenda. If, contrariwise, the evidence base is somehow regarded as ‘raw data’ and the interpretative process is left incomplete and unhinged, this will simply reproduce the existing division of labour in which policy-makers do the thinking and other stakeholders bear the consequences. EBP has to be somewhat bolder than this or it will provides a mere decorative backwash to policy-making, a modern-day version of the old ivory-tower casuistry, ‘here are some assorted lessons of world history, prime minister - now go rule the country’.

Pick out exemplary programmes for an ‘honourable mention’.

However modestly, most narrative reviews do in fact strive to make the extra step into policy recommendations and we turn next to main method of drawing out the implications of the evidence base. This is the narrative version of the ‘best buy’ approach and it takes the form of identifying ‘exemplary programmes’. In practice this often involves using the ‘recommendations for action’ or ‘executive summary’ section of the review to re-emphasise the winning qualities of the chosen cases. The basis for such a selection has been covered already – successful programmes follow from the combination and compatibility of a range of attributes captured in the database. A de facto comparison is involved in such manoeuvres in that less successful programs obviously do not posses such configurations. These are entirely legitimate ambitions for a narrative review. But the question remains - how is the policy community able to cash in on the merits of exemplary programmes? What are the lessons for future programming and practice? What theory of ‘generalisation’ underlies this approach?

The answer to the latter question is captured in the phrase ‘proximal similarity’ (Shadish, 1991). Basically, the idea is to learn from review by following the successful programmes. What makes them successful is the juxtaposition of many attributes and so the goal for future programme design is to imitate the programme as a whole or at least try to gather in as many similarities as possible. In the varied world of narrative reviews this, I believe, is the closest we get to a dominant guiding principle. (Note that it is almost the opposite to the idea of heterogeneous replication – it is a sort of ‘homogeneous replication’)

There are several problems with using the proximal-similarity principle as the basis for drawing transferable policy lessons. The first is that it is impossible to put into practice (quite a drawback!). As soon a one shifts a programme to another venue, there are inevitably differences in infra-structure, institutions, practitioners and subjects. Implementation details (the stuff of narrative reviews) are thereby subtly transformed, making it almost impossible to plan for or to create exact similarities. And, if one creates only ‘partial similarities’ in new versions of the programmes, one doesn’t know whether the vital causal configurations on which success depends are broken. Such a dismal sequence of events, alas, is very well known in the evaluation community. Successful ‘demonstration projects’ have often proved the devil the replicate, precisely because programme outcomes are the sum of their assorted, wayward, belligerent parts (Tilley N, 1996).

A second problem with a configurational understanding of programme efficacy stems from its own subtlety. There is no assumption that the making of successful programmes follows only the single ‘recipe’. Interventions may get to the same objectives by different means. Yet again, the parallel with our example from comparative historical method rears its head. Having identified the configuration of conditions producing the first industrial revolution does not mean, of course, that the second and subsequent industrial nations follow the same pathway (Tilly C, 1975, Ch2). What counts in history and public policy is having the right idea for the apposite time and place.

Wise as this counsel is, it is a terribly difficult prospect to identify the potential synergy of an intervention and its circumstances on the basis of the materials identified in a narrative review. As we have seen, programme descriptions in narrative reviews are highly selective. Programme descriptions are also ‘ontologically flat’ (Sayer, 2000). This is quite literally the case as one follows the various elements of each intervention across the data matrix. What Sayer and other realists actually have in mind with the use of the phrase, however, is that the various ‘properties’ extracted in review are just that - they are a standard set of observable features of programme. Now, change actually takes place in a ‘ontologically deep’ social world in which the dormant capacities of individuals are awaked by new ideas, from which new ways of acting emerge.

However painstaking a research review, such ‘causal powers’ and ‘emergent properties’ of interventions are not the stuff of the data extraction form. In the standard mode of presenting narrative reviews (recall table 2), one gets a picture of the particulars of each intervention, without getting a feel for the way that the programmes actually work. For instance, several reviews of road safety education, including the aforementioned work by Towner, have indicated that children can better recall rules for crossing busy roads if they are learned using roadside ‘mock-ups’ rather than through diagrams and pictures. Why this might be the case is presumably to do with the benefits of learning practical skills ‘in situ’ rather than ‘on paper’. In other words, we are not persuaded that one particular programme configuration works better than does another because of the mere juxtaposition of its constituent properties. Rather what convinces is our ability to draw upon an implicit, much-used and widely useful theory.

Note that my criticism here is not to do with missing evidence or wrong-headed interpretation. Rather I am arguing that the extraction of exemplary cases in narrative overview is not simply a case of ‘reviewing the evidence’ but depends, to a considerable degree, on the tacit testing of submerged theories. Why this state of affairs is so little acknowledged is something of a mystery. Perhaps it is the old empiricist fear of the ‘speculative’ and the ‘suprasensible’. Be that as it may, such an omission sits very oddly with the design of interventions, which is all about the quest for improvement in ‘programme theory’. No doubt the quest for verisimilitude in road safety training came from the bright idea that learning from actual practice (if not through mistakes!) was the way ahead. The moral of the tale, however, is clear. The review process needs to acknowledge this vital characteristic of all programmes and thus appreciate that the evidence base is theory-laden.

c) Meld together some common denominators of programme success.

As indicated above, the broad church of narrative review permits several means to forward policy ends. So as well as the piling high of descriptive data and the extraction of exemplars, there are also a range of hybrid strategies which own something the method of meta-analysis. I have space to mention just one variant on this theme here. The more extensive the review, the more likely is it that a range of good practice has cropped up across the various sub-families of initiatives contained therein. Instead of handling these ‘one at a time’ for flattery by means of imitation, there are also attempts to extract some ‘collective lessons’ in narrative form.

Whilst the cell entries take the form of text rather than numbers (table 2 again), the basic data-base of a narrative review is still in the form of a matrix. This allows one to pose questions about which manifestations of the various programme properties tend to be associated with programme success. Given narrative form of data this cannot be put as a multivariate question about the respective correlates of programme success. But what is possible is to pick upon each programme property in turn and try to discover what the exemplars share in common in respect of that particular ‘attribute’. Such an exercise tends to arrive in the ‘summary and conclusions’ section as series of recommendations about the good practice in programme design and implementation. This is, of course, very much in keeping in keeping with the emphasis on intervention processes that pervades the qualitative evaluation traditions.

Thus of programme development the message of such a review will often be something like, ‘inter-agency collaboration is essential to develop different elements of a local campaign’ (Towner, 1996). Of the programme targets, the shared quality of successful interventions might be ‘initiatives must be sensitive to individual differences’ (ibid). In other words, the theory of generalisation in this narrative sub-strategy has shifted to what might be called the search for the ‘common denominators of implementation success’. This is a more confusing and confused project because it cross-cuts the ‘heterogeneous-replication and the ‘proximal-similarities’ principles. I find it very hard to fault the collected wisdom of recommendations such as the above. But, because they combine the disembodied, variable-based method of meta-analysis with a narrative emphasis on process, they often end up looking rather trite. Often we are presented with bureaucratic truisms, with certain of the recommendations for programme effectiveness applying, no doubt, to organising the bus trip to Southend.

Conclusion

What conclusions should be drawn from this brief review of the main methods of systematic review? Let me make it clear what I have attempted to say and tried not to say. I have disavowed what I think is the usual account, a preference for the numerical or the narrative. Neither, though I might be accused of doing so, have I declared a plague on both houses. The key point, I think, is this. There are different ways of explaining why a particular programme has been a success of failure. Any particular evaluation will thus capture only a partial account of the efficacy of an intervention. When it comes to the collective evaluation of whole families of programmes, the lessons learned become even more selective.

Accordingly, any particular model of EBP will be highly truncated in its explanation of what has worked and what has not. In this respect, I hope to have shown that meta-analysis ends up with decontextualised lessons and that narrative review concludes with overcontextualised recommendations. My main ambition has thus been to demonstrate that there is plenty of room left right there in the middle for an approach that is sensitive to the local conditions of programme efficacy but then renders such observations into transferable lessons². And what might such an approach look like?

All the basic informational mechanics have to be there, of course, as does the eye for outcome variation of meta-analysis, as does the sensitivity to process variation of narrative review. Putting all of these together is a rather tough ask. The intended synthesis will not follow from some mechanical blending of existing methods (we caught a glimpse of such an uneasy compromise in my final example). What is required, above all, is a clear logic of how research review is to underpin policy prognostication. The attentive reader will have already read my thoughts on the basic ingredients of that logic. I have mentioned how systematic review often squeezes out attention to programme ‘mechanisms’, ‘contexts’ and ‘outcome patterns’. I have already mentioned how ‘middle-range theories’ lurk tacitly in the selection of best buys. These concepts are the staples of realist explanation (Pawson and Tilly, 1997), which leads me to suppose there might be promise in a method of ‘realist synthesis’.

Acknowledgement

The author would like to thank two anonymous referees for their reminder on the dangers of over-simplification in methodological debate. All remaining over-simplifications, of course, are all my own work.

Bibliography

Campbell, D (1974) ‘Evolutionary Epistemology’. In P. Schilpp (ed) The Philosophy of Karl Popper. La Salle: Open Court.

Colditz, R et al (1995) ‘Efficacy of BCG vaccine in the prevention of tuberculosis’ Journal of the American Medical Association, Vol. 291 pp 698-702.

Davies, P. (2000) 'The relevance of systematic review to educational policy and practice' Oxford Review of Education 26: 365-378

Dixon-Woods, M., Fitzpatrick, R and Roberts, K. (2001) ‘Including qualitative research in systematic reviews.’ Journal of Evaluation of Clinical Practice, Vol. 7 pp 125-133.

Durlak, J. and Wells, A. (1997) ‘Primary Prevention Mental Health Programs for Children and Adolescents: A Meta-Analytic Review’. American Journal of Community Psychology, 25, pp. 115-152.

Gallo, P. (1978) ‘Meta-analysis – a mixed metaphor’ American Psychologist Vol. 33 pp. 515-517.

Harden, A., Weston. R. and Oakley, A. (1999) ‘A review of the effectiveness and appropriateness of peer-delivered health promotion interventions for young people.’ EPPI-Centre Research Report, Social Science Research Unit, Institute of Education, University of London.

Hunt, M (1997) How Science Takes Stock, New York, Russell Sage Foundation.

Kaplan, A (1964) The Conduct of Inquiry. New York: Chandler.

McGuire, J.B., Stein, A. and Rosenberg, W. (1997) ‘Evidence-based Medicine and Child Mental Health Services’. Children and Society, 11:89-96.

Moore, B. (1966) Social Origins of Dictatorship and Democracy. New York: Peregrine.

Pawson, R. and Tilley, N (1997) Realistic Evaluation. London: Sage.

Popay, J., Rogers, and Williams, G. (1998) ‘Rationale and standards for the systematic review of qualitative literature in health service research.’ Qualitative Health Research Vol 8 pp 341-351.

Ragin, C. (1987) The Comparative Method. Berkeley: University of California Press.

Robertson, J. (2001) ‘A critique of hypertension trials and of their evaluation’ Journal of Evaluation of Clinical Practice, Vol. 7 pp 149-164.

Sayer, A (2000) Realism and Social Science. London: Sage

Shadish, W. and Sweeney, R (1991) ‘Mediators and moderators in meta-analysis’, Journal of Counselling and Clinical Psychology, Vol. 59 pp 883-893.

Shadish, W., Cook, T. and Levison, L. (1991) Foundations of Program Evaluation. Newbury Park: Sage.

Smith, M. and Glass, G. (1977) ‘Meta-analysis of psychotherapeutic outcome studies’, American Psychologist Vol. 32 pp. 752-760.

Sutton, A., Abrams, K. and Jones, D (2001) ‘An illustrated guide to meta-analysis’, Journal of Evaluation of Clinical Practice, Vol. 7 pp 135-148.

Tilley, N. (1996) ‘Demonstration, exemplification, duplication and replication in evaluation research’. Evaluation, 2: 35-50.

Tilly, C. (1975) Big Structures, Large Processes, Huge Comparisons. New York: Russell Sage Foundation.

Towner, E., Dowswell, T. and Jarvis, S. (1996) Reducing Childhood Accidents - The Effectiveness of Health Promotion Interventions: A Literature Review. Health Education Authority.

Walker, B. (1996) Injury Prevention for Young Children. Westport, Connecticut: Greenwood Press.

Weissberg, R. and Bell, D. (1997) ‘A Meta-Analytic Review of Primary Prevention Programs for Children and Adolescents: Contributions and Caveats’. American Journal of Community Psychology, 25, 2: 207-214.

Figure one: The research/policy sequence

i. Evaluation Research (standard mode)

Design Implementation Impact

research

starts

here

research

ends

here

ii. Meta-analysis/Synthesis/Review

Design Implementation Impact

research

ends

here

research

starts

here

feedback

Table One: The ‘classify’, ‘tally’ and ‘compare’ model of meta-analysis

Type of Program	n	Mean effect
Environment-centred
School-based	15	0.35
Parent-training	10	0.16^a
Transition Programs
Divorce	7	0.36
School-entry / change	8	0.39
First-time mothers	5	0.87
Medical/dental procedure	26	0.46
Person-centred programs
Affective education
Children 2-7	8	0.70
Children 7-11	28	0.24
Children over 11	10	0.33
Interpersonal problem solving
Children 2-7	6	0.93
Children 7-11	12	0.36
Children over 11	0	-
Other person-centred programmes
Behavioural approach	26	0.49
Non-behavioural approach	16	0.25

Notes: a. Non-significant result:
Source: Durlak and Wells, 1997:129

Table Two: The ‘summary matrix’ in narrative review

Road Safety Education – Experimental Programmes

Author & date of publication	Country of study	Injury target group (age in years)	Study type	Type of Intervention	Healthy Alliance	Outcome Measure(s)	Outcome(s)
Yeaton & Bailey 1978	USA	5-9	Before and after study	One to one real life demonstrations to teach 6 street crossing skills	Crossing patrol, schools	Observed behaviour	Skills improved from 48% - 97% and 21% - 96 %. Maintained at one year
Nishioka et.al. 1991	Japan	4-6	Before and after study with 2 comparison groups	Group training aimed at dashing out behaviour	No details	Reported Behaviour	Improvements in behaviour dependent on level of training. 40% of children shown to be unsafe AFTER training

Plus 7 more experimental programmes

Road Safety Education – Operational Programmes

Author & date of publication	Country of study	Injury target group (age in years)	Study type	Type of Intervention	Healthy Alliance	Outcome Measure(s)	Outcome(s)
Schioldborg 1976	Norway	Pre school	Before and after study with control group	Traffic club	parents	Injury rates, observed behaviour	No effect on traffic behaviour. Reported 20% reduction in casualty rates and 40% in Oslo
Anataki et. al. 1986	UK	5	Before and after study with control group	Road safety education using Tufty materials	Schools, road safety officers	Knowledge	All children improved on test score over the 6 month period, however children exposed to the Tufty measures performed no better than non intervention group

Plus 4 more experimental programmes

Road Safety Education – Area-Wide Engineering Programmes

10 programmes

Road Safety Education – Cycle helmet studies

10 programmes

Road Safety Education – Enforcement legislation

3 programmes

Road Safety Education – Child restraint loan schemes

9 programmes

Road Safety Education – Seat belt use

9 programmes

Source: Towner et al 1996

1 For details of, and access to, EBP players such as the Cochrane Collaboration, Campbell Collaboration, EPPI-Centre etc., see the Resources section of the ESRC UK Centre for Evidence Based Policy and Practice website at http://www.evidencenetwork.org

2 So whilst I arrive at it by a very different means, my conclusion mirrors much recent thinking about the need for a ‘multi-method’ strategy of research review and a ‘triangulated’ approach to the evidence base. It is in the area of evidence based health care that this strategy has received most attention. There are any number of calls for this amalgamation of approaches (Popay et al, 1998) as well as methodological pieces (Dixon-Woods et al, 2001) on the possible roles for qualitative evidence in the combined approaches. Indeed the ‘first attempt to include and quality assess process evaluations as well as outcome evaluations in a systematic way’ was published in 1999 (Harden et al). Alas it is beyond the scope of this paper to assess these efforts, which will be the subject of a forthcoming paper. A glance at the conclusion to Harden et al’s review (p129), however, will show that these authors have not pursued the realist agenda.

El analista vallecano

martes, 5 de mayo de 2009

Evidence Based Policy:

I. In Search of a Method

Abstract

1.1 ‘Meta’ is better

1.2 Numerical meta-analysis

1.3. Narrative reviews

Conclusion

Acknowledgement

Bibliography

Figure one: The research/policy sequence

i. Evaluation Research (standard mode)

Design Implementation Impact

ii. Meta-analysis/Synthesis/Review

Design Implementation Impact

Table One: The ‘classify’, ‘tally’ and ‘compare’ model of meta-analysis

Table Two: The ‘summary matrix’ in narrative review