Program/Course/Class Evaluation | Books

Conducting an Evaluative Inquiry |
Evaluative Inquiry Definition

Action and Inquiry in Tandem | Similarities to Action Research

The Five Steps of Evaluative Inquiry

Positioning the Evaluative Inquiry |
Planning the Evaluative Inquiry

Collecting the Data | Analyzing and Synthesizing the Data

Communicating the Inquiry Results |
Resources

Research Designs/Methods |
Qualitative Research

Resources on Qualitative Research

Organizing Qualitative Data for Thematic Analysis

Systematic Reviews, Meta-Analysis, and Research
Synthesis

Randomized Experiments and Quasi-Experiments |
Quantitative Research Statistics

Resources for Quantitative Statistics

Fundamentally, scientific inquiry is the same in all disciplines. Whether in anthropology, biology, economics, chemistry, or education, the research process is one of rigorous reasoning supported by a dynamic interaction of methods, theories, findings, and perspectives. The idea is to build models and theories that express understandings in a way that can be tested. The accumulation of scientific knowledge is nonlinear and indirect. It is an accumulation of studies, not one study, that establishes a base of understanding in a field.

The scientific research enterprise depends on a vibrant community of researchers and is guided by fundamental principles.

This section provides information on

- program/ course/class evaluation designs,
- conducting an evaluative inquiry (the type of evaluation design the CLIPs will use to begin their work), and
- general research designs and methods.

The field of educational evaluation is based on educational research methods that have evolved over many years. Evaluation goes a step beyond research to make judgments of the merit, worth, and value of a given entity for a given situation. The common thread among evaluation models is that they all seek to identify the value, merit, and worth of a program, course, class, or other entity. They vary in the methods used, how participants are involved, data collection tools, and ways of reporting.

Daniel Stufflebeam, a long-standing leader in the field of program evaluation, has identified 20 approaches to evaluation that exist in the literature that he sees as legitimate ways of evaluating programs. He has grouped them into three categories although the approaches do not always clearly fall into one of these categories: questions/methods-oriented approaches, improvement/accountability approaches, and social agenda/advocacy approaches. The approach we are using in the CLIPs is most aligned with the utilization focused evaluation which has features of each of these categories of evaluation.

To learn more about evaluation methods and perspectives we recommend the following book:

- Patton, M. (1997).
*Utilization-focused evaluation.*Thousand Oaks, CA: Sage.

To learn about a range of evaluation theories and models, see:

- Stufflebeam, D. (2001).
*Evaluation models.*New Directions for Evaluation. (#89). San Francisco, CA: Jossey-Bass. - Worthen, B, Sanders, J.& Fitzpatrick, J. (1997).
*Program evaluation.*New York: Longman.

For the CLIPs, we are using an approach referred to an Evaluative Inquiry. and fits within the broad category of evaluation methods referred to as utilization-focused evaluation. In her book, “Evaluative Inquiry: Using Evaluation to Promote Student Success”, Parsons (2002) explains why she uses the term “evaluative inquiry” rather than “evaluation.” There are two reasons. First because “evaluation” is often seen as something that someone else does “to” schools. Second, because the term “evaluative inquiry” balances attention to the investigation itself with its purpose. That is, the process of conducting the inquiry can be as useful as the ultimate findings. The benefits of evaluative inquiry are in both the process and its informative results.

Oriented toward the future, evaluative inquiry is about finding what you value and then moving toward it. Contrasting “what is” with “what is desired” involves making judgments, but not the kind of blaming and criticism often associated with evaluation. Instead, evaluative inquiry invites self-reflection and offers the perspective of “critical friends” who can identify discrepancies between what you want and what you have. It emphasizes analysis and synthesis of information (rather than data collection) and places the inquiry process in the hands of educators rather than outsiders (though outsiders can serve in a coaching role).

The evaluative inquiry approach builds the inquiry process along side the teaching and other actions that faculty and others undertake within a program, course, or class. Figure 1 below illustrates this relationship.

As the CLIP work progresses, each CLIP may more to more structured evaluation designs that use experimental or quasi-experimental research designs which use various types of treatment and controls groups.

Evaluative inquiry has similarities to action research. Action research is a particular way of researching your own learning and practice. It is sometimes referred to as practitioner research or practitioner-led or practitioner-based research. The central idea is one of self -reflection. In traditional forms of research researchers conduct their research on other people. In action research researchers do research on themselves in cooperation with other people. Those people are doing the same thing. No distinction is made between who is a researcher and who is a practitioner. Action research involves learning in and through both action and reflection. (See McNiff, 2003.)

The five steps of conducting an evaluative inquiry are:

- Positioning the inquiry
- Planning the inquiry
- Collecting the data
- Analyzing and synthesizing the data
- Communicating the inquiry findings

These steps are typical of most evaluations. The steps are illustrated in Figure 2, below.

Student learning, rather than teaching, is emphasized in both program action and the inquiry. This shift from the traditional focus on teaching to a focus on learning is illustrated by the following story:

One day a man was walking his dog down the street when he ran into his neighbor.

He said, “Guess what! I taught my dog how to talk!”

“That’s incredible,” the neighbor exclaimed. “Have him say a few words.”

“Oh,” the man replied. “I just taught him. He didn’t learn.”

[Figure 2 Goes Here]

The key to the planning task is keeping data collection, analysis, synthesis, and communication well focused on the questions being investigated and the interests of users of the inquiry results. In this situation, the CLIP members and those they have identified are the primary users of the inquiry results. They will use the results to refine the initiative and its vision, to communicate with others who share responsibility for the work, and to refine the evaluative inquiry focus for the next phase of investigation.

Once you’ve identified the inquiry users, build the analyses, syntheses, and data collection questions; this will ensure a focus on the relationship between learning experiences and learning outcomes. This is not to say that one should put blinders on and ignore unexpected or unplanned paths, but it is to say that following those paths should be an intentional decision. Before following a new direction, carefully think through its impact on the final results and determine what resources will be required for the work.

Then come the practicalities of developing timelines and tasks. These give you a sense of the magnitude of the work and prepare you for the next important decision: budget. You’ll want to discover how much time and money you can afford at the beginning so that you can carry your plan to completion. It is better to do thorough analyses, syntheses, and communications on a smaller amount of data than to gather extensive data and shortchange the analyses, syntheses, and communications—the steps of meaning making and use.

The data collection task consists of gathering the data and preparing initial summaries. Gathering the data has three parts: (1) determining who will be the source of information; (2) developing data collection instruments, such as interview guides and questionnaires; and (3) collecting the information (e.g., conducting interviews). Preparing initial summaries of data may involve applying criteria of quality, identifying themes in qualitative data, and/or calculating basic statistics for quantitative data. (These topics are addressed in other parts of this website.)

Analyzing and synthesizing information goes beyond the usual data summaries. Tables of average ratings from questionnaires or test scores are meaningless without links to the instruction experienced. The analyses also incorporate research about the issue being investigated. The enhanced insights about the links between the outcomes and the learning experiences being investigated will reward your investment of time and resources.

Once the analyses are complete, you can synthesize the findings by contrasting the actual situation with the vision set out by the CLIP or others they identify. The differences between the two will enable you to derive ideas about next steps for the CLIP, which might decide to refine the implementation process or perhaps adjust the vision.

There is no shame in adjusting the vision. The vision needs to be flexible. When program leaders articulate a vision, they’re setting out a rough idea of where they are headed, not a permanent target. Being comfortable adjusting the vision is as important as being willing to change the implementation process. Remember, we are in a dynamic environment. We cannot expect our vision to stay fixed in such a context.

In the final stage, you communicate to the appropriate parties the findings that are based on your syntheses. The process brings the users back to their initial intentions so they can see what changes to make in their vision and work.

**Books**

- Parsons, B. (2002).
*Evaluative Inquiry: Using evaluation to promote student success.*Thousand Oaks, CA: Corwin Press.

Note that the above book is designed for K-12 settings. The examples need quite a bit of “translation” to fit a community college setting. - McNiff, J. (2003).
*Action research: Principles and practice.*Second Edition. New York: Routledge Falmer. - Preskill, H., & Torres, R. (1999).
*Evaluative inquiry for learning in organizations.*Thousand Oaks, CA: Sage.

This section contains information on five topics:

- qualitative research
- organizing qualitative data for thematic analysis
- systematic reviews, meta-analysis, and research synthesis
- quantitative research designs—randomized experiments and quasi-experiments
- quantitative research statistics

Much of the work that the CLIPs will be doing initially is likely to fall in the area of qnalitative research and synthesis of existing research. Later the work may involve more quantiative methods.

Qualitative research is grounded in the assumption that individuals construct social reality in the form of meanings and interpretations, and that these constructions tend to be transitory and situational. Frequently meanings and interpretations are determined through studying situations intensively in their natural setting. Case studies are often written as the means of reporting.

Qualitative research typically involves qualitative data, i.e., data obtained through methods such interviews, on-site observations, and focus groups that is in narrative rather than numerical form. Such data are analyzed by looking for themes and patterns. It involves reading, rereading, and exploring the data. How the data are gathered will greatly affect the ease of analysis and utility of findings.

* Books*Numerous books are available on this topic. Two suggested books are:

- Patton, M. (2002).
*Qualitative research & evaluation methods.*(3rd edition).Thousand Oaks, CA: Sage. - Wholey, J., Hatry, H., & Newcomer, K. (eds). (2004).
*Handbook of Practical Program Evaluation.*San Francisco, CA: Jossey-Bass.

These books are available through the BC Professional Growth Center. Check
with Sarah Phinney. Many other fine books exist as well.

Analysis of data from qualitative data collection tools can be greatly facilitated by using a word processing application to organize the data in tables that can be sorted by respondent, question, and other characteristics.

*Instructions for Entering Qualitative Data*

- Create a three-column list of the respondents (people to whom
questionnaires have been sent), sort the list alphabetically, and assign a
*participant number*to each person (starting with “1” though the total number of participants).

**v****#****Respondent Name**1 Allison, David 2 Baker, Susan 3 Cairns, Mary 4 Callahan, Jennifer 5 Cooper, Richard 6 - Create a three-column table in a word-processing application, making the
first two columns narrow and the third column wide (see example below).

- Label the columns as follows:
- Column 1 is for the
*participant number*. Use a label such as**P**for participant,**T**for teacher, or**S**for student, depending on the population you are surveying and what works for you. - Column 2 is for the
*question number*. Use a label such as**Q**or**#**for this column, again something easy to remember. - Column 3 is for the
*response*. Use a label such as**Response**.

P Q Response - Column 1 is for the
- As each questionnaire is received, write that person’s
*participant number*at the top of the first page of the questionnaire and check that person’s name off the list. You can add another column for the check marks.

**#****Respondent Name**1 Allison, David 2 Baker, Susan **v**3 Cairns, Mary 4 Callahan, Jennifer **v**5 Cooper, Richard 6 - To begin entering the responses, place your cursor in the first cell of
Column 1. Enter the
*participant number*. For example, let’s say that Jennifer Callahan is the first person to return her questionnaire. Put a 4 in the first cell of Column 1. Then in the first cell of Column 2, put the first*question number*, in this case “1”. In the first cell of Column 3 put her*response*this question.

P Q Response 4 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - Continue entering all of Jennifer’s responses. For each response, first
enter her
*participant number*, then the*question number*, and then the*response*. If she has left a question unanswered, then enter something like “No response” or some other place holder in the*response*column to remind you later that her response to this question was not overlooked, and she, in fact, did not respond to this question.

P Q Response 4 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 4 2 No response 4 3 xxxxxxxxxxxxx 4 4 xxxxxxxxxxxxxxxxxxxxxx 4 5 xxxx - Continue in this fashion until all responses for all returned
questionnaires have been entered. You can enter them as they come in or wait
until you have a large number of them and enter them at the same time. The
order you in which you enter the individual questionnaires is not important.

P Q Response 4 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 4 2 No response 4 3 xxxxxxxxxxxxx 4 4 xxxxxxxxxxxxxxxxxxxxxx 4 5 xxxx 2 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 2 xxxxxxxxxxxxxxxx 2 3 xxxxxxxxx 2 4 xxxxxxxxxxxxxxxxx 2 5 xx 5 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5 2 xxxxxxxxxxxxxxxxxxxxx 5 3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5 4 xxxxxxxxxxxxxxxx 5 5 xxx - When all questionnaire responses have been entered, sort them so that
the information will be most useful to you. For example:
- First sort on column 1 (
*participant number*) to put all the respondents in alphabetical order.

P Q Response 2 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 2 xxxxxxxxxxxxxxxx 2 3 xxxxxxxxx 2 4 xxxxxxxxxxxxxxxxx 2 5 xx 4 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 4 2 No response 4 3 xxxxxxxxxxxxx 4 4 xxxxxxxxxxxxxxxxxxxxxx 4 5 xxxx 5 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5 2 xxxxxxxxxxxxxxxxxxxxx 5 3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5 4 xxxxxxxxxxxxxxxx 5 5 xxx - Then sort on column 2 (
*question number*) so that you have all the responses to each question together.

P Q Response 2 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 4 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 2 xxxxxxxxxxxxxxxx 4 2 No response 5 2 xxxxxxxxxxxxxxxxxxxxx 2 3 xxxxxxxxx 4 3 xxxxxxxxxxxxx 5 3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 4 xxxxxxxxxxxxxxxxx 4 4 xxxxxxxxxxxxxxxxxxxxxx 5 4 xxxxxxxxxxxxxxxx 2 5 xx 4 5 xxxx 5 5 xxx

- First sort on column 1 (
- Now you are ready to work with the data.

Systematic reviews, meta-analyses, and research syntheses involve searching the literature, assembling studies for review, coding and combining studies, and interpreting and reporting the results. The rationale behind using this approach is that examining many related, well-conducted studies is more productive than relying on a single study. People engaged in research review try to bring order to a body of material and understand what good evidence there is to justify the claim that a intervention, program, or policy is effective.

**Definitions and Value of this Method**

Definitions are:

- Systematic review is the use of strategies that limit bias in the assembly, critical appraisal, and synthesis of all relevant studies on a specific topic. (Meta-analysis may be, but is not necessarily part of this process.)
- Meta-analysis is the statistical synthesis of the data from separate but comparable studies that leads to a quantitative summary of the pooled results.
- Research synthesis is an attempt to integrate empirical research for the purpose of creating generalizations in a way that is non-judgmental and covers the complete research base.

In the process of review, bias can occur when a reviewer looks only at materials that fit his/her ideological or theoretical preference or when a reviewer pays attention only to reports published in journals and not other materials. Bias can also be introduced when the studies themselves exhibit statistical bias or it is difficult to tell the level of bias in a study’s results.

The value of doing conscientious systematic reviews, meta-analyses, and research syntheses is that looking at many studies is a better way of understanding the effects of a program than just looking at the results of one study. Findings from a single study cannot be easily generalized to other settings, but a review of many studies might shed some light on the possibilities for replication of program effects in other situations.

A research review also shows where the good evidence is, where it is absent, and where it is ambiguous. A reviewer can weed out untrustworthy studies or may find out that no high-quality evaluation has been done on a particular topic. The reviewer may also expose flaws in conventional literature reviews of a topic.

**The Systematic Review Process**

The best way to learn about systematic review is to read some review reports, take a methods course, or become involved in a review process that is governed by high-quality standards.

The major steps in conducting a systematic review, meta-analysis, or research synthesis are:

- Specify the topic area. Identify the rationale for addressing the problem, the questions to be addressed, and relevant outcome variables, target populations, and interventions that address the problem.
- Specify the search strategy. Identify the literature to be searched (journals, reports from organizations that screen for quality studies, unpublished studies, etc.). Decide how the literature will be searched (hand search being more reliable than a machine-based search), and with what resources.
- Develop criteria for deciding what studies to include and exclude in the review. Several rounds of selection are useful, using increasingly more detailed criteria. One key test of the value of a study is to estimate the effect size of a study (this is a statistical or graphical computation).
- Develop a scheme for coding the studies and their properties (e.g., details about the intervention, characteristics of the samples, definitions of specific outcomes, etc.). Code the studies and based on the results, screen out studies that do not meet the standards.
- Develop a management strategy and procedures for who will do what, when, and with what resources, and under what ground rules.
- Develop an analysis strategy which includes describing and comparing effect sizes of the different studies.
- Interpret and report the results. Produce two types of reports—1) a detailed report that contains all information so that another reviewer could conduct an identical review and 2) a summary report (hard copy and electronic) geared for users who are not researchers.

There are international, governmental, and technical resources available to help researchers conduct high-quality reviews. Systematic reviews, meta-analysis, and research synthesis are growing in use and quality and are resulting in added value for the field of evaluation as a whole.

For research-based evaluation, two types of designs can be used to estimate the effect of a program, policy, or other kind of treatment: randomized experiments and quasi-experiments. In randomized experiments, participants are assigned to a testing group at random, while quasi-experiments use non-random groups. Randomized experiments generally produce more credible results, but are more difficult to implement than quasi-experiments, because to do them successfully researchers have to intervene in an established setting or create their own setting. In contrast, a quasi-experiment can often be implemented with minimal, or even no interruption to a program or situation, and in some cases this is the only feasible approach to take.

**Randomized Experiments**

Randomized experiments are the most credible way to measure program impact. However, they are not easy to design, implement, and maintain, and are time-consuming and expensive, yet they produce credible, unbiased information about the effect of a program. The random assignment of people to treatment and control groups boosts the causal connection between the program intervention and outcomes. With a growing focus on accountability, there is increased demand at the national level for data from evaluations based on random experiments.

An evaluation that seeks to assess the impact of a program should consider experimental design. However, researchers must determine if there is the time, resources, and political will to support doing an experimental study and whether there is a research platform where the study can be conducted, e.g., a set of projects, schools, or sites.

Some basic tasks when implementing a randomized experiment are:

- Define the experimental contrasts. Researchers first pose questions that specify comparisons they might want to study. They determine which question is the most important to pursue and what the best and most realistic condition would be for the control group so that there is a useful comparison.
- Specify the unit of random assignment. Researchers choose the unit that they will use to randomly assign, e.g., students, families, classrooms and teachers, schools, districts. Using students or families is the most efficient in terms of statistical power, but where using these units would interfere with normal operations or be seen as favoring individuals, then using classes or schools as the unit works well.
- Set a desired level of statistical power. Researchers need to understand how their design decisions effect the potential for yielding statistically significant results (statistical power). The power of an evaluation design depends on many factors, e.g., the size of differences in the data that evaluators want to be able to detect, the sample size of the comparison groups, the number of interventions being tested and comparisons being made, and whether evaluators want to analyze the effect on subgroups.
- Deal with nonparticipation, crossovers, and attrition. Results are effected when participants in a study don’t take full advantage of the program, participants change groups, or drop out of the study. There are ways to deal with these problems statistically and still get good measures of the effect of the program.
- Preserve randomization. Researchers need to decide when to make the randomize assignments, e.g., upon recruitment of potential participants, after the program evaluation is explained to them, or after a try-out period. Incentives can help keep people in the control groups invested in the study, e.g., letting them participate in a competing program that serves as the control situation, giving them the chance to participate in the program after the study is completed.

The most difficult part of experimental research is implementing the experiment. This calls for:

- Explaining the purposes, advantages, and disadvantages of experiments to the staff who are implementing the program being studied and dealing with and countering objections to randomization, e.g., withholding treatment to the control group, etc.
- Searching for situations in which randomization can most easily be implemented and providing incentives for program staff to go along with the experimental research
- Preparing and entering into a written evaluation agreement with program staff
- Ensuring the integrity of the randomization process by having researchers, not staff, manage this part of the process
- Ensuring that the program is implemented properly and with consistency across groups or sites and collecting data on the implementation process to check for variations.

**Quasi-Experimentation**

Quasi-experimental designs are divided into two types:

- those that draw comparisons across different
**times**—the before-after design and the interrupted time-series design, and - those that draw comparisons across different
**participants**—the nonequivalent group design and the regression-discontinuity design.

Neither type is better than the other and the choice of which to use depends on the research situation, potential problems with validity, and the kinds of design features that might be added. Adding features can greatly strengthen these basic research designs and in doing so, the distinction blurs between the two types (different times vs. different participants).

Quasi-experimental designs can be enhanced by adding these features:

- treatment interventions - more than one intervention point over time, e.g., adding/taking away/adding an intervention
- comparison groups - adding one or more comparison groups, e.g., one that gets the intervention (treatment group) and one that does not (control group)
- measurement occasions - one or more pretest measures
- outcome variables - different variables measured from the same participants

One or more of these features can be added to each of the designs described below.

**Before-After Comparisons**

In a before-after comparison, the participant or group is measured before (pretest) and after (posttest) a program is introduced and the difference in results measures the effect. This design is easy to use, but highly susceptible to a number of potential biases or alternate explanations for the observed change (i.e., threats to internal validity—history, maturation, seasonality, testing, instrumentation, attrition, and statistical regression). For this reason, it is rarely a useful design, as is. However, the other three quasi-experimental designs are variations of the before-after approach that take into account and adjust for problems with validity, and thus are more useful.

*Interrupted Time-Series Designs*

In the interrupted time-series design, several observations or measurements are taken at different points before the program and then several afterwards. The pretest measurements are used to project a trend line to predict changes that would occur without the program. Then the actual trend of results after the program is compared to the predicted trend and any differences are attributed to the program effect. In some cases results might show an immediate or abrupt effect (an “interruption” of the trend line).

This design addresses some potential bias, but can be improved by adding treatment interventions, other comparison groups, and/or different outcome variables. Strengths of this design are: several post-treatment observations help researchers see if the effects increase or decline over time; it can be implemented without a comparison group; and it can be used with a small group or individuals. Weaknesses are that it is resource intensive and often requires sophisticated statistical methods to analyze the data.

**Nonequivalent Group Designs**

The nonequivalent group design compares participants who receive different treatments and who are in non-random groups (e.g., self-selected groups, preexisting groups, etc.). The posttest-only design is the most basic of this type, i.e., one group receives the program and the other does not and both are assessed afterwards. Assessment results are then compared. The main problem with this design is that the groups are not necessarily comparable in characteristics that might effect the results. Researchers can enhance credibility by using groups that are as similar as possible.

The nonequivalent group design is easy to implement and although data must be collected from two groups, this can often be done without too much disruption. It can be strengthened by adding one or more pretests for both groups and comparing pretest/posttest growth. Other options are to add more comparison groups to the posttest-only design, expose the different groups to different levels of treatment, and/or measure more than one outcome variable.

**Regression-Discontinuity Designs**

In a regression-discontinuity design, participants are ranked for a specific variable (Quantitative Assignment Variable) and then assigned to treatment groups based on their ranking. Common variables used in this type of design are measures of need or merit, e.g., people with high need are in one group, those with low need in another. The analysis is done by graphing the scatter of data points for each person by the assignment variable and the outcome, and then figuring the regression lines for the treatment group and the comparison group. If the regression lines are the same, there is no effect, but if they are a different height and/or slope there is a program effect.

This design can also be enhanced by adding design features. It often produces more credible results than the nonequivalent group design, but is harder to implement because of the rule for assigning participants to groups, which may not practical in some settings. Generally a randomized experiment is more powerful and preferable to the regression-discontinuity design.

**Using Statistics in Evaluation**

Statistics are used in a variety of ways to support evaluation. It is important to start planning the statistical analyses at the same time that planning for an evaluation begins. Decisions about analysis techniques to use and statistics to report are affected by levels of measurement of the variables in the study, the questions being addressed, and the type and level of information that clients expect in the report.

An important step in planning is to select the levels of measurement for key variables of interest in the evaluation study. Doing this helps to then determine the right analytical techniques to use. In 1946, Stevens identified four levels of measurement that have been used to describe empirical data ever since (nominal, ordinal, interval, and ratio). Nominal and ordinal levels of measurement are categorical. Nominal measures use numbers to assign data to different groups. Ordinal measures assign data to categories that have some kind of ordered relationship, e.g. successful, partially successful, unsuccessful. Ordinal variables play a key role in evaluation studies. Interval and ratio measurements are on a numeric continuum and can be mathematically manipulated. Ratio variables are the same as interval variables, except ratio variables include a zero point.

**Descriptive and Inferential Statistics**

Descriptive statistics are numbers used to describe a group of items. Inferential statistics are computed from a sample drawn from a larger population with the intention of making generalizations from the sample about the whole population. The accuracy of inferences drawn from a sample is critically affected by the sampling procedures used. Four guiding principles for sample selection are:

- The population of interest must be reasonably known and identifiable. In cases where records about a population are incomplete, evaluators must be sure this is not due to bias.
- Use a sampling technique where the probability for selecting any unit in the population can be calculated (probability sampling), e.g., using random numbers to select units (random sampling), selecting every nth unit (systematic sampling), and dividing the population into subgroups and sampling within the subgroups (stratified sampling).
- The size of the sample should be appropriate relative to the size of the population that will be generalized. There are formulas for deciding sample size based on the confidence level and amount of error that evaluators want for the study.
- Examine a probability sample to be sure it is truly representative of the larger population about which evaluators will generalize on critical variables, e.g., gender, race, etc.

When evaluators do not have access to the full population and thus cannot use probability sampling, then they cannot make generalizations from the sample to the population. However they can still make inferences if they explain how the sample may vary from the population and what potential sources of bias exist. Statistical tests (chi square and the t test) can be used to test the statistical significance or generalizability of relationships between variables.

**Statistical Hypothesis Testing**

To apply inferential statistics, researchers use a procedure called statistical hypothesis testing. First they identify a statistical hypothesis that states the relationship between two variables of interest. This is stated in the form of a null hypothesis, i.e., a statement that the program/intervention has no effect on the intended outcome. If the data rejects the null hypothesis, then the conclusion is that the program has had an effect. If the null hypothesis is not rejected, then the program had no effect.

Errors of Type I (false-positive) and Type II (false-negative) can cause a discrepancy between the tests results and the true situation, calling the conclusions into question. Evaluators need to look at features of the evaluation design that effect error and take steps to avoid or minimize the more costly type of error. For example, a false-positive conclusion that a program has an effect when it really does not could mean that future funding is wasted on a program that does not work. In this case, evaluators want to protect against false-positive error as much as possible. It is a delicate choice, however, because the more that you protect against one type of error, the more vulnerable the study will be to the opposite type of error. (Statistical textbooks have reference lists of features that are likely to generate false-positives and false-negatives.)

**Selecting a Statistical Confidence Level and Reporting the Confidence
Interval**

Another step in using inferential statistics is to decide on the statistical confidence level for the study. The confidence level is the amount of evidence evaluators want to have to be able to say that the conclusions of the study are correct, i.e., that the program produced the observed effect. The confidence level also shows their confidence that a false-positive error will not occur. In social science, a 95 percent confidence level is generally used. This means that the program effects found in the sample can be generalized to the entire population with only a 5 percent chance that the test has a false-positive error. Another way of saying this is the evaluator can be 95 percent confident that the sample findings were not simply the result of random variation.

In many studies a 90 percent or 80 percent confidence level is adequate and can reduce the size of the sample needed in the study and thus the cost of the study. Where the costs of committing a false-negative error are high, an 80 percent confidence level is called for.

When the null hypothesis is rejected using a confidence level of 95 percent, then the evaluator can state that the sample data is statistically significant at a confidence level of 95 percent. It tells us that a relationship between two variables in the sample reflects a real relationship in the larger population of study.

When reporting program effects, evaluators should clearly state the effects falling within a certain range and also state the confidence interval or margin of error (e.g., plus or minus 2 percent).

**Testing Statistical Significance for Nominal and Ordinal-Level
Variables**

The chi-square test is a statistical tool that evaluators can use to test the statistical significance of relationships between variables with any number of nominal-level categories. (Chi-square tests are also frequently used with ordinal scales.) It can test for differences among three or more groups or compare two or more samples. For example, a chi-square test can show whether one or more ethnic groups tend to benefit differently from a program as compared to other groups. Statistical Package for the Social Sciences (SPSS) is the commonly used software program.

Some things for evaluators to know about the chi-square test are:

- It is quite sensitive to sample size. For precision and usefulness the sample size should be large enough; the general rule is to have a frequency of no less than 5 in each cell. If there are cells with less than 5, combining cells and thus the number of categories helps.
- Chi-square can be used no matter how the variables were measured.
- Chi-square results do not tell how strongly two variables are related. In addition to the chi-square test, evaluators need to use measures of strength to find out the magnitude of the relationship.

This summary covered the concept of statistical significance or whether a relationship between variables in a sample can be generalized to the population of study. In statistics, there is another separate judgment that is made about the magnitude of the program effect. It is a statement of the practical importance of the measured effect, e.g., is a 1% increase on student achievement scores that is statistically significant of any real consequence or not? There are no standards for interpreting the magnitude of the size of the program effect and it is up to evaluators to find comparable figures. Interpretation of magnitude of effect is a judgment call.

To help in this regard, there are a number of measures of association that determine the strength of the relationship between two variables and whether one of the variables is dependent (affected by) the other. Measures of association for nominal data are: Phi squared, Cramer’s V, Pearson’s contingency coefficient, Goodman/Kruskal’s tau, and Lambda. If the data are rankings, the statistic used is Spearman’s r. Measures of association for ordinal data are: Goodman /Kruskal’s T, Kendall’s T-b, Stuart’s T-c, and Somer’s D. Interval data uses Pearson’s r.

**Selecting Appropriate Statistics**

There are three categories of criteria for evaluators to consider when selecting the most appropriate statistical technique for their study. These categories focus on the evaluation questions, the measurement methods, and the type of audience for the study.

Key questions to ask in each area are:

**Question-Related Criteria**

- Is generalization from the sample to the population desired?
- Is the causal relationship between an alleged cause and alleged effect of interest? Is it an impact question?
- Does the question (or statutory or regulatory document) contain quantitative criteria to which results can be compared

**Measurement-Related Criteria**

- At what level of measurement were the variables measured: nominal (e.g., gender), ordinal (e.g., attitudes measured with Likert-type scales), or interval (e.g., income)?
- Were multiple indicators used to measure key variables?
- What are the sample sizes in pertinent subgroups?
- How many observations were recorded for the respondents: one, two, or more (time series)?
- Are the samples independent or related? That is, was the sample measured at two or more points in time (related)?
- What is the distribution of each of the variables of interest, such as bimodal or normal?
- How much precision was incorporated in the measures?
- Are there outliers affecting calculation of statistics, that is, extremely high or low values that skew the mean and other statistics?

**Audience-Related Criteria**

- Will the audience understand sophisticated analytical techniques such as multiple regression?
- Will graphic presentations of data (such as bar charts) be more appropriate than tables filled with numbers?
- How much precision does the audience want in numerical estimates?
- Will the audience be satisfied with graphs depicting trends or desire more sophisticated analyses such as regressions?
- Will the audience understand the difference between statistical significance and the practical importance of numerical findings?

When evaluators address impact questions and want to select a technique to estimate or predict program impact, they need to look at the level of measurement (nominal, ordinal, or interval) to make this decision. The best and most common technique used for nominal data is a contingency table that displays frequency counts. Contingency tables and frequency distributions are also the best option for analysis of ordinal data. Evaluators have the widest range of choices for techniques to analyze interval data. When they want to explain an effect (dependent variable) by other variables, they often use regression analysis.

When multiple indicators are used to measure a program effect, evaluators can use two basic strategies to sort the measures or units and reduce the data to a smaller number of factors. The strategy to use when measures are pre-set is to aggregate the different measures, weight them, and sum them up. The other strategy is to use analytical techniques to identify patterns in the measures, e.g., factor analysis that finds groupings among variables to reduce the number of factors. Discriminant function analysis is a way to sort the units of study by high and low performance and then to look at what other characteristics predict levels of performance. Cluster analysis can also be used to identify similar groupings among participants or units in a study.

Evaluators must consider other criteria when selecting a statistical technique; sample size (is it too small to demonstrate an effect?); number of observations recorded (e.g., with two or more observations, evaluators can analyze change over time); distribution of the units along each variable (is the sample range wide enough for study?); and level of precision of the data (could respondents make the fine distinctions asked for in the data or not?). Evaluators must also decide on how to handle outliers in the data and make sure the statistics will be accessible to their audience.

**Reporting Statistics**

When reporting statistical results, clarity is essential. Some tips for presenting data analysis are:

- identify the contents of all tables and figures clearly
- indicate the decision rules that were made in the analysis
- consolidate the analysis whenever possible
- do not abbreviate
- provide basic information about how variables were measured
- present appropriate percentages, not raw data
- present information on statistical significance and the magnitude of relationships clearly
- report any threats to the statistical information
- use user-friendly graphics to present analytical findings clearly

Statistics never speak for themselves, but evaluators must take great care to ensure that they speak with statistics accurately and clearly.

**Using Regression Analysis**

Correlation and regression are powerful tools that are frequently used in evaluation and applied research. Regression analysis is used to describe relationships, test theories, and make predictions with data from experimental or observational studies, linear or nonlinear relationships, and continuous or categorical predictors. The user must select specific regression models that are appropriate to the data and research questions.

Many practical questions involve the relationship between a dependent variable (Y) and a set of independent variables (X1, X2, X3, etc.) where scores are measured for N cases. For example, a study might be designed to predict performance (Y) using information on years of experience (X1), an aptitude test (X2), and participation in a training program (X3). The multiple regression equation calculates the predicted Y value for individual cases. The correlation between observed Y value and predicted Y value is called the multiple correlation coefficient, R.

Using this example, regression models can be designed to address these types of questions and more:

- Can performance be predicted better than chance using the regression equation?
- Does the training program improve our ability to predict performance, or can we do as well with only the first two predictors
- Could we improve prediction by including an additional variable?
- Is the relationship between performance and years of experience linear, or is the relationship curvilinear?
- Is the relationship between aptitude and performance stronger or weaker for people who participated in the training program?

**Comparing Two Groups**

Regression analysis can be used to compare two groups. For example, it can be used to graphically test the difference between the mean post-test scores of two different groups of students who received different training. The two groups are nominal points on the X axis and the post-test score is measured on the Y-axis, with scatter points of data positioned for both groups and a regression line that intersects the two group means and represents the predicted value of post-test scores. The regression coefficient is the difference between the two means. A more powerful research design would be to include pre-test measures. In this case, regression analysis uses the pretest scores to predict the posttest scores. Posttest scores located above the regression line are better than predicted and those below show lower than predicted performance on the posttest. Regression analysis provides measures of statistical significance for the variables.

**Mediation Analysis with Regression**

Regression can be used to describe and test conceptual models of how a program works, providing a useful framework for examining and improving key components of a program. A simple test of this type would focus on the causal relationship between the level of program implementation (X) and a certain outcome (Y). A more complicated model and one that provides more information about how the program produces its effect is a study where the program (X) has an effect on an intervening mediator variable (M), which in turn has an impact on the outcome (Y). For example, the goal of a school drug prevention program is to reduce the intention of adolescents to use marijuana. The program is presumed to increase knowledge about the effects of marijuana, which in turn is presumed to decrease intention to use marijuana. Evaluators can use regression to test the validity of this model. Results of a regression analysis can be presented in visual form or as a table with key numbers from the analysis listed.

**Other Issues**

Some special concepts that are likely to be useful in evaluation applications of regression analysis are:

- Categorical variables. These types of variables, e.g., religion or ethnicity need to be handled in a special way in order to be used in regression analysis.
- Correlation and causation. Inferring causality in correlational studies requires strong knowledge or assumptions about relationships.
- Multicolinearity. If a certain predictor can itself be predicted very well by the other predictors in the model, the problem is called multicolinearity. If two predictors are highly correlated, it may be desirable to eliminate one or make a composite of the two.
- Interactions. If two predictor variables interact, then the relationship between one predictor and the dependent variable is conditional on the level of the other predictor.
- Centering continuous predictor variables. The interpretation of regression coefficients can often be improved by centering predictor variables. Centering involves subtracting the mean from the variable for each case.
- Nonlinear relationships. If the relationship between variables is nonlinear, you cannot use a linear model.
- Outliers. Extreme scores can distort results of a model, especially with a small sample size. Outliers may be errors that can be corrected or cases that can be omitted for separate analysis.
- Missing data. Missing data can cause problems with regression analysis. There are some ways to deal with this, but the best strategy is to minimize missing data.
- Power analysis and sample size. Low power of a statistical test is a problem in much of the research on program effects. Low power can be increased by increasing sample size or reducing error (better measures or better statistical control), increasing effect size (more powerful treatments), or increasing something called the alpha error rate.
- Stepwise versus hierarchical selection of variables. Some statistical programs have a stepwise regression option that selects out the best predictors from a larger set of potential predictors. It is preferable not to use this option, but to have the researcher determine the order of entering the variables and be the one to choose the best predictors.

In conclusion, be thoughtful rather than mechanical with data analysis. A big advantage people have over the computer is that they can ask, “Does this make sense?” Do not lose this advantage. Get close to your data.

**Books**

- Wholey, J., Hatry, H., & Newcomer, K. (eds). (2004).
*Handbook of Practical Program Evaluation.*San Francisco, CA: Jossey-Bass.

Note: This book is available in the Professional Growth Center from Sarah Phinney. Many other statistics books exist as well.