Program/Course/Class Evaluation | Books
Conducting an Evaluative Inquiry |
Evaluative Inquiry Definition
Action and Inquiry in Tandem | Similarities to Action Research
The Five Steps of Evaluative Inquiry
Positioning the Evaluative Inquiry | Planning the Evaluative Inquiry
Collecting the Data | Analyzing and Synthesizing the Data
Communicating the Inquiry Results | Resources
Research Designs/Methods |
Resources on Qualitative Research
Organizing Qualitative Data for Thematic Analysis
Systematic Reviews, Meta-Analysis, and Research Synthesis
Randomized Experiments and Quasi-Experiments | Quantitative Research Statistics
Resources for Quantitative Statistics
Fundamentally, scientific inquiry is the same in all disciplines. Whether in anthropology, biology, economics, chemistry, or education, the research process is one of rigorous reasoning supported by a dynamic interaction of methods, theories, findings, and perspectives. The idea is to build models and theories that express understandings in a way that can be tested. The accumulation of scientific knowledge is nonlinear and indirect. It is an accumulation of studies, not one study, that establishes a base of understanding in a field.
The scientific research enterprise depends on a vibrant community of researchers and is guided by fundamental principles.
This section provides information on
The field of educational evaluation is based on educational research methods that have evolved over many years. Evaluation goes a step beyond research to make judgments of the merit, worth, and value of a given entity for a given situation. The common thread among evaluation models is that they all seek to identify the value, merit, and worth of a program, course, class, or other entity. They vary in the methods used, how participants are involved, data collection tools, and ways of reporting.
Daniel Stufflebeam, a long-standing leader in the field of program evaluation, has identified 20 approaches to evaluation that exist in the literature that he sees as legitimate ways of evaluating programs. He has grouped them into three categories although the approaches do not always clearly fall into one of these categories: questions/methods-oriented approaches, improvement/accountability approaches, and social agenda/advocacy approaches. The approach we are using in the CLIPs is most aligned with the utilization focused evaluation which has features of each of these categories of evaluation.
To learn more about evaluation methods and perspectives we recommend the following book:
To learn about a range of evaluation theories and models, see:
For the CLIPs, we are using an approach referred to an Evaluative Inquiry. and fits within the broad category of evaluation methods referred to as utilization-focused evaluation. In her book, “Evaluative Inquiry: Using Evaluation to Promote Student Success”, Parsons (2002) explains why she uses the term “evaluative inquiry” rather than “evaluation.” There are two reasons. First because “evaluation” is often seen as something that someone else does “to” schools. Second, because the term “evaluative inquiry” balances attention to the investigation itself with its purpose. That is, the process of conducting the inquiry can be as useful as the ultimate findings. The benefits of evaluative inquiry are in both the process and its informative results.
Oriented toward the future, evaluative inquiry is about finding what you value and then moving toward it. Contrasting “what is” with “what is desired” involves making judgments, but not the kind of blaming and criticism often associated with evaluation. Instead, evaluative inquiry invites self-reflection and offers the perspective of “critical friends” who can identify discrepancies between what you want and what you have. It emphasizes analysis and synthesis of information (rather than data collection) and places the inquiry process in the hands of educators rather than outsiders (though outsiders can serve in a coaching role).
The evaluative inquiry approach builds the inquiry process along side the teaching and other actions that faculty and others undertake within a program, course, or class. Figure 1 below illustrates this relationship.
As the CLIP work progresses, each CLIP may more to more structured evaluation designs that use experimental or quasi-experimental research designs which use various types of treatment and controls groups.
Evaluative inquiry has similarities to action research. Action research is a particular way of researching your own learning and practice. It is sometimes referred to as practitioner research or practitioner-led or practitioner-based research. The central idea is one of self -reflection. In traditional forms of research researchers conduct their research on other people. In action research researchers do research on themselves in cooperation with other people. Those people are doing the same thing. No distinction is made between who is a researcher and who is a practitioner. Action research involves learning in and through both action and reflection. (See McNiff, 2003.)
The five steps of conducting an evaluative inquiry are:
These steps are typical of most evaluations. The steps are illustrated in Figure 2, below.
Student learning, rather than teaching, is emphasized in both program action and the inquiry. This shift from the traditional focus on teaching to a focus on learning is illustrated by the following story:
One day a man was walking his dog down the street when he ran into his neighbor.
He said, “Guess what! I taught my dog how to talk!”
“That’s incredible,” the neighbor exclaimed. “Have him say a few words.”
“Oh,” the man replied. “I just taught him. He didn’t learn.”
[Figure 2 Goes Here]
The key to the planning task is keeping data collection, analysis, synthesis, and communication well focused on the questions being investigated and the interests of users of the inquiry results. In this situation, the CLIP members and those they have identified are the primary users of the inquiry results. They will use the results to refine the initiative and its vision, to communicate with others who share responsibility for the work, and to refine the evaluative inquiry focus for the next phase of investigation.
Once you’ve identified the inquiry users, build the analyses, syntheses, and data collection questions; this will ensure a focus on the relationship between learning experiences and learning outcomes. This is not to say that one should put blinders on and ignore unexpected or unplanned paths, but it is to say that following those paths should be an intentional decision. Before following a new direction, carefully think through its impact on the final results and determine what resources will be required for the work.
Then come the practicalities of developing timelines and tasks. These give you a sense of the magnitude of the work and prepare you for the next important decision: budget. You’ll want to discover how much time and money you can afford at the beginning so that you can carry your plan to completion. It is better to do thorough analyses, syntheses, and communications on a smaller amount of data than to gather extensive data and shortchange the analyses, syntheses, and communications—the steps of meaning making and use.
The data collection task consists of gathering the data and preparing initial summaries. Gathering the data has three parts: (1) determining who will be the source of information; (2) developing data collection instruments, such as interview guides and questionnaires; and (3) collecting the information (e.g., conducting interviews). Preparing initial summaries of data may involve applying criteria of quality, identifying themes in qualitative data, and/or calculating basic statistics for quantitative data. (These topics are addressed in other parts of this website.)
Analyzing and synthesizing information goes beyond the usual data summaries. Tables of average ratings from questionnaires or test scores are meaningless without links to the instruction experienced. The analyses also incorporate research about the issue being investigated. The enhanced insights about the links between the outcomes and the learning experiences being investigated will reward your investment of time and resources.
Once the analyses are complete, you can synthesize the findings by contrasting the actual situation with the vision set out by the CLIP or others they identify. The differences between the two will enable you to derive ideas about next steps for the CLIP, which might decide to refine the implementation process or perhaps adjust the vision.
There is no shame in adjusting the vision. The vision needs to be flexible. When program leaders articulate a vision, they’re setting out a rough idea of where they are headed, not a permanent target. Being comfortable adjusting the vision is as important as being willing to change the implementation process. Remember, we are in a dynamic environment. We cannot expect our vision to stay fixed in such a context.
In the final stage, you communicate to the appropriate parties the findings that are based on your syntheses. The process brings the users back to their initial intentions so they can see what changes to make in their vision and work.
This section contains information on five topics:
Much of the work that the CLIPs will be doing initially is likely to fall in the area of qnalitative research and synthesis of existing research. Later the work may involve more quantiative methods.
Qualitative research is grounded in the assumption that individuals construct social reality in the form of meanings and interpretations, and that these constructions tend to be transitory and situational. Frequently meanings and interpretations are determined through studying situations intensively in their natural setting. Case studies are often written as the means of reporting.
Qualitative research typically involves qualitative data, i.e., data obtained through methods such interviews, on-site observations, and focus groups that is in narrative rather than numerical form. Such data are analyzed by looking for themes and patterns. It involves reading, rereading, and exploring the data. How the data are gathered will greatly affect the ease of analysis and utility of findings.
Numerous books are available on this topic. Two suggested books are:
These books are available through the BC Professional Growth Center. Check
with Sarah Phinney. Many other fine books exist as well.
Analysis of data from qualitative data collection tools can be greatly facilitated by using a word processing application to organize the data in tables that can be sorted by respondent, question, and other characteristics.
Instructions for Entering Qualitative Data
|2||Baker, Susan v|
|4||Callahan, Jennifer v|
Systematic reviews, meta-analyses, and research syntheses involve searching the literature, assembling studies for review, coding and combining studies, and interpreting and reporting the results. The rationale behind using this approach is that examining many related, well-conducted studies is more productive than relying on a single study. People engaged in research review try to bring order to a body of material and understand what good evidence there is to justify the claim that a intervention, program, or policy is effective.
Definitions and Value of this Method
In the process of review, bias can occur when a reviewer looks only at materials that fit his/her ideological or theoretical preference or when a reviewer pays attention only to reports published in journals and not other materials. Bias can also be introduced when the studies themselves exhibit statistical bias or it is difficult to tell the level of bias in a study’s results.
The value of doing conscientious systematic reviews, meta-analyses, and research syntheses is that looking at many studies is a better way of understanding the effects of a program than just looking at the results of one study. Findings from a single study cannot be easily generalized to other settings, but a review of many studies might shed some light on the possibilities for replication of program effects in other situations.
A research review also shows where the good evidence is, where it is absent, and where it is ambiguous. A reviewer can weed out untrustworthy studies or may find out that no high-quality evaluation has been done on a particular topic. The reviewer may also expose flaws in conventional literature reviews of a topic.
The Systematic Review Process
The best way to learn about systematic review is to read some review reports, take a methods course, or become involved in a review process that is governed by high-quality standards.
The major steps in conducting a systematic review, meta-analysis, or research synthesis are:
There are international, governmental, and technical resources available to help researchers conduct high-quality reviews. Systematic reviews, meta-analysis, and research synthesis are growing in use and quality and are resulting in added value for the field of evaluation as a whole.
For research-based evaluation, two types of designs can be used to estimate the effect of a program, policy, or other kind of treatment: randomized experiments and quasi-experiments. In randomized experiments, participants are assigned to a testing group at random, while quasi-experiments use non-random groups. Randomized experiments generally produce more credible results, but are more difficult to implement than quasi-experiments, because to do them successfully researchers have to intervene in an established setting or create their own setting. In contrast, a quasi-experiment can often be implemented with minimal, or even no interruption to a program or situation, and in some cases this is the only feasible approach to take.
Randomized experiments are the most credible way to measure program impact. However, they are not easy to design, implement, and maintain, and are time-consuming and expensive, yet they produce credible, unbiased information about the effect of a program. The random assignment of people to treatment and control groups boosts the causal connection between the program intervention and outcomes. With a growing focus on accountability, there is increased demand at the national level for data from evaluations based on random experiments.
An evaluation that seeks to assess the impact of a program should consider experimental design. However, researchers must determine if there is the time, resources, and political will to support doing an experimental study and whether there is a research platform where the study can be conducted, e.g., a set of projects, schools, or sites.
Some basic tasks when implementing a randomized experiment are:
The most difficult part of experimental research is implementing the experiment. This calls for:
Quasi-experimental designs are divided into two types:
Neither type is better than the other and the choice of which to use depends on the research situation, potential problems with validity, and the kinds of design features that might be added. Adding features can greatly strengthen these basic research designs and in doing so, the distinction blurs between the two types (different times vs. different participants).
Quasi-experimental designs can be enhanced by adding these features:
One or more of these features can be added to each of the designs described below.
In a before-after comparison, the participant or group is measured before (pretest) and after (posttest) a program is introduced and the difference in results measures the effect. This design is easy to use, but highly susceptible to a number of potential biases or alternate explanations for the observed change (i.e., threats to internal validity—history, maturation, seasonality, testing, instrumentation, attrition, and statistical regression). For this reason, it is rarely a useful design, as is. However, the other three quasi-experimental designs are variations of the before-after approach that take into account and adjust for problems with validity, and thus are more useful.
Interrupted Time-Series Designs
In the interrupted time-series design, several observations or measurements are taken at different points before the program and then several afterwards. The pretest measurements are used to project a trend line to predict changes that would occur without the program. Then the actual trend of results after the program is compared to the predicted trend and any differences are attributed to the program effect. In some cases results might show an immediate or abrupt effect (an “interruption” of the trend line).
This design addresses some potential bias, but can be improved by adding treatment interventions, other comparison groups, and/or different outcome variables. Strengths of this design are: several post-treatment observations help researchers see if the effects increase or decline over time; it can be implemented without a comparison group; and it can be used with a small group or individuals. Weaknesses are that it is resource intensive and often requires sophisticated statistical methods to analyze the data.
Nonequivalent Group Designs
The nonequivalent group design compares participants who receive different treatments and who are in non-random groups (e.g., self-selected groups, preexisting groups, etc.). The posttest-only design is the most basic of this type, i.e., one group receives the program and the other does not and both are assessed afterwards. Assessment results are then compared. The main problem with this design is that the groups are not necessarily comparable in characteristics that might effect the results. Researchers can enhance credibility by using groups that are as similar as possible.
The nonequivalent group design is easy to implement and although data must be collected from two groups, this can often be done without too much disruption. It can be strengthened by adding one or more pretests for both groups and comparing pretest/posttest growth. Other options are to add more comparison groups to the posttest-only design, expose the different groups to different levels of treatment, and/or measure more than one outcome variable.
In a regression-discontinuity design, participants are ranked for a specific variable (Quantitative Assignment Variable) and then assigned to treatment groups based on their ranking. Common variables used in this type of design are measures of need or merit, e.g., people with high need are in one group, those with low need in another. The analysis is done by graphing the scatter of data points for each person by the assignment variable and the outcome, and then figuring the regression lines for the treatment group and the comparison group. If the regression lines are the same, there is no effect, but if they are a different height and/or slope there is a program effect.
This design can also be enhanced by adding design features. It often produces more credible results than the nonequivalent group design, but is harder to implement because of the rule for assigning participants to groups, which may not practical in some settings. Generally a randomized experiment is more powerful and preferable to the regression-discontinuity design.
Using Statistics in Evaluation
Statistics are used in a variety of ways to support evaluation. It is important to start planning the statistical analyses at the same time that planning for an evaluation begins. Decisions about analysis techniques to use and statistics to report are affected by levels of measurement of the variables in the study, the questions being addressed, and the type and level of information that clients expect in the report.
An important step in planning is to select the levels of measurement for key variables of interest in the evaluation study. Doing this helps to then determine the right analytical techniques to use. In 1946, Stevens identified four levels of measurement that have been used to describe empirical data ever since (nominal, ordinal, interval, and ratio). Nominal and ordinal levels of measurement are categorical. Nominal measures use numbers to assign data to different groups. Ordinal measures assign data to categories that have some kind of ordered relationship, e.g. successful, partially successful, unsuccessful. Ordinal variables play a key role in evaluation studies. Interval and ratio measurements are on a numeric continuum and can be mathematically manipulated. Ratio variables are the same as interval variables, except ratio variables include a zero point.
Descriptive and Inferential Statistics
Descriptive statistics are numbers used to describe a group of items. Inferential statistics are computed from a sample drawn from a larger population with the intention of making generalizations from the sample about the whole population. The accuracy of inferences drawn from a sample is critically affected by the sampling procedures used. Four guiding principles for sample selection are:
When evaluators do not have access to the full population and thus cannot use probability sampling, then they cannot make generalizations from the sample to the population. However they can still make inferences if they explain how the sample may vary from the population and what potential sources of bias exist. Statistical tests (chi square and the t test) can be used to test the statistical significance or generalizability of relationships between variables.
Statistical Hypothesis Testing
To apply inferential statistics, researchers use a procedure called statistical hypothesis testing. First they identify a statistical hypothesis that states the relationship between two variables of interest. This is stated in the form of a null hypothesis, i.e., a statement that the program/intervention has no effect on the intended outcome. If the data rejects the null hypothesis, then the conclusion is that the program has had an effect. If the null hypothesis is not rejected, then the program had no effect.
Errors of Type I (false-positive) and Type II (false-negative) can cause a discrepancy between the tests results and the true situation, calling the conclusions into question. Evaluators need to look at features of the evaluation design that effect error and take steps to avoid or minimize the more costly type of error. For example, a false-positive conclusion that a program has an effect when it really does not could mean that future funding is wasted on a program that does not work. In this case, evaluators want to protect against false-positive error as much as possible. It is a delicate choice, however, because the more that you protect against one type of error, the more vulnerable the study will be to the opposite type of error. (Statistical textbooks have reference lists of features that are likely to generate false-positives and false-negatives.)
Selecting a Statistical Confidence Level and Reporting the Confidence Interval
Another step in using inferential statistics is to decide on the statistical confidence level for the study. The confidence level is the amount of evidence evaluators want to have to be able to say that the conclusions of the study are correct, i.e., that the program produced the observed effect. The confidence level also shows their confidence that a false-positive error will not occur. In social science, a 95 percent confidence level is generally used. This means that the program effects found in the sample can be generalized to the entire population with only a 5 percent chance that the test has a false-positive error. Another way of saying this is the evaluator can be 95 percent confident that the sample findings were not simply the result of random variation.
In many studies a 90 percent or 80 percent confidence level is adequate and can reduce the size of the sample needed in the study and thus the cost of the study. Where the costs of committing a false-negative error are high, an 80 percent confidence level is called for.
When the null hypothesis is rejected using a confidence level of 95 percent, then the evaluator can state that the sample data is statistically significant at a confidence level of 95 percent. It tells us that a relationship between two variables in the sample reflects a real relationship in the larger population of study.
When reporting program effects, evaluators should clearly state the effects falling within a certain range and also state the confidence interval or margin of error (e.g., plus or minus 2 percent).
Testing Statistical Significance for Nominal and Ordinal-Level Variables
The chi-square test is a statistical tool that evaluators can use to test the statistical significance of relationships between variables with any number of nominal-level categories. (Chi-square tests are also frequently used with ordinal scales.) It can test for differences among three or more groups or compare two or more samples. For example, a chi-square test can show whether one or more ethnic groups tend to benefit differently from a program as compared to other groups. Statistical Package for the Social Sciences (SPSS) is the commonly used software program.
Some things for evaluators to know about the chi-square test are:
This summary covered the concept of statistical significance or whether a relationship between variables in a sample can be generalized to the population of study. In statistics, there is another separate judgment that is made about the magnitude of the program effect. It is a statement of the practical importance of the measured effect, e.g., is a 1% increase on student achievement scores that is statistically significant of any real consequence or not? There are no standards for interpreting the magnitude of the size of the program effect and it is up to evaluators to find comparable figures. Interpretation of magnitude of effect is a judgment call.
To help in this regard, there are a number of measures of association that determine the strength of the relationship between two variables and whether one of the variables is dependent (affected by) the other. Measures of association for nominal data are: Phi squared, Cramer’s V, Pearson’s contingency coefficient, Goodman/Kruskal’s tau, and Lambda. If the data are rankings, the statistic used is Spearman’s r. Measures of association for ordinal data are: Goodman /Kruskal’s T, Kendall’s T-b, Stuart’s T-c, and Somer’s D. Interval data uses Pearson’s r.
Selecting Appropriate Statistics
There are three categories of criteria for evaluators to consider when selecting the most appropriate statistical technique for their study. These categories focus on the evaluation questions, the measurement methods, and the type of audience for the study.
Key questions to ask in each area are:
When evaluators address impact questions and want to select a technique to estimate or predict program impact, they need to look at the level of measurement (nominal, ordinal, or interval) to make this decision. The best and most common technique used for nominal data is a contingency table that displays frequency counts. Contingency tables and frequency distributions are also the best option for analysis of ordinal data. Evaluators have the widest range of choices for techniques to analyze interval data. When they want to explain an effect (dependent variable) by other variables, they often use regression analysis.
When multiple indicators are used to measure a program effect, evaluators can use two basic strategies to sort the measures or units and reduce the data to a smaller number of factors. The strategy to use when measures are pre-set is to aggregate the different measures, weight them, and sum them up. The other strategy is to use analytical techniques to identify patterns in the measures, e.g., factor analysis that finds groupings among variables to reduce the number of factors. Discriminant function analysis is a way to sort the units of study by high and low performance and then to look at what other characteristics predict levels of performance. Cluster analysis can also be used to identify similar groupings among participants or units in a study.
Evaluators must consider other criteria when selecting a statistical technique; sample size (is it too small to demonstrate an effect?); number of observations recorded (e.g., with two or more observations, evaluators can analyze change over time); distribution of the units along each variable (is the sample range wide enough for study?); and level of precision of the data (could respondents make the fine distinctions asked for in the data or not?). Evaluators must also decide on how to handle outliers in the data and make sure the statistics will be accessible to their audience.
When reporting statistical results, clarity is essential. Some tips for presenting data analysis are:
Statistics never speak for themselves, but evaluators must take great care to ensure that they speak with statistics accurately and clearly.
Using Regression Analysis
Correlation and regression are powerful tools that are frequently used in evaluation and applied research. Regression analysis is used to describe relationships, test theories, and make predictions with data from experimental or observational studies, linear or nonlinear relationships, and continuous or categorical predictors. The user must select specific regression models that are appropriate to the data and research questions.
Many practical questions involve the relationship between a dependent variable (Y) and a set of independent variables (X1, X2, X3, etc.) where scores are measured for N cases. For example, a study might be designed to predict performance (Y) using information on years of experience (X1), an aptitude test (X2), and participation in a training program (X3). The multiple regression equation calculates the predicted Y value for individual cases. The correlation between observed Y value and predicted Y value is called the multiple correlation coefficient, R.
Using this example, regression models can be designed to address these types of questions and more:
Comparing Two Groups
Regression analysis can be used to compare two groups. For example, it can be used to graphically test the difference between the mean post-test scores of two different groups of students who received different training. The two groups are nominal points on the X axis and the post-test score is measured on the Y-axis, with scatter points of data positioned for both groups and a regression line that intersects the two group means and represents the predicted value of post-test scores. The regression coefficient is the difference between the two means. A more powerful research design would be to include pre-test measures. In this case, regression analysis uses the pretest scores to predict the posttest scores. Posttest scores located above the regression line are better than predicted and those below show lower than predicted performance on the posttest. Regression analysis provides measures of statistical significance for the variables.
Mediation Analysis with Regression
Regression can be used to describe and test conceptual models of how a program works, providing a useful framework for examining and improving key components of a program. A simple test of this type would focus on the causal relationship between the level of program implementation (X) and a certain outcome (Y). A more complicated model and one that provides more information about how the program produces its effect is a study where the program (X) has an effect on an intervening mediator variable (M), which in turn has an impact on the outcome (Y). For example, the goal of a school drug prevention program is to reduce the intention of adolescents to use marijuana. The program is presumed to increase knowledge about the effects of marijuana, which in turn is presumed to decrease intention to use marijuana. Evaluators can use regression to test the validity of this model. Results of a regression analysis can be presented in visual form or as a table with key numbers from the analysis listed.
Some special concepts that are likely to be useful in evaluation applications of regression analysis are:
In conclusion, be thoughtful rather than mechanical with data analysis. A big advantage people have over the computer is that they can ask, “Does this make sense?” Do not lose this advantage. Get close to your data.