Skip Nav

Definition of Reliability

Reliability in research

❶However, the presence of so many uncontrolled variables may lead to low internal validity in that we can't be sure which variables are affecting the observed behaviors.

Inter-Rater or Inter-Observer Reliability

TQR Publications
Follow TQR on:
Internal Reliability and Personality Tests

Reliability, like validity , is a way of assessing the quality of the measurement procedure used to collect data in a dissertation. In order for the results from a study to be considered valid , the measurement procedure must first be reliable. In this article, we: When we examine a construct in a study, we choose one of a number of possible ways to measure that construct [see the section on Constructs in quantitative research , if you are unsure what constructs are, or the difference between constructs and variables].

For example, we may choose to use questionnaire items, interview questions, and so forth. These questionnaire items or interview questions are part of the measurement procedure.

This measurement procedure should provide an accurate representation of the construct it is measuring if it is to be considered valid. For example, if we want to measure the construct, intelligence , we need to have a measurement procedure that accurately measures a person's intelligence. Since there are many ways of thinking about intelligence e. In quantitative research, the measurement procedure consists of variables ; whether a single variable or a number of variables that may make up a construct [see the section on Constructs in quantitative research ].

When we think about the reliability of these variables, we want to know how stable or constant they are. This assumption, that the variable you are measuring is stable or constant, is central to the concept of reliability. In principal, a measurement procedure that is stable or constant should produce the same or nearly the same results if the same individuals and conditions are used.

So what do we mean when we say that a measurement procedure is constant or stable? If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters.

For instance, let's say you had observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the observations the raters checked the same category.

OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation. The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale.

You could have them give their rating at regular time intervals e. The correlation between these ratings would give you an estimate of the reliability or consistency between the raters. You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit.

Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item.

Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters. We estimate test-retest reliability when we administer the same test to the same sample on two different occasions.

This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions.

The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error.

Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval. In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets.

You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is often no easy feat. Furthermore, this approach makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance this will sometimes not be the case. To be reliable, an inventory measuring self-esteem should give the same result if given twice to the same person within a short period of time.

IQ tests should not give different results over time as intelligence is assumed to be a stable characteristic. Validity refers to the credibility or believability of the research. Are the findings genuine? Is hand strength a valid measure of intelligence? Almost certainly the answer is "No, it is not. The answer depends on the amount of research support for such a relationship. Internal validity - the instruments or procedures used in the research measured what they were supposed to measure.

As part of a stress experiment, people are shown photos of war atrocities. After the study, they are asked how the pictures made them feel, and they respond that the pictures were very upsetting.

In this study, the photos have good internal validity as stress producers. External validity - the results can be generalized beyond the immediate study.

This article is a part of the guide:

Main Topics

Privacy Policy

Reliability is a necessary ingredient for determining the overall validity of a scientific experiment and enhancing the strength of the results. Debate between social and pure scientists, concerning reliability, is robust and ongoing.

Privacy FAQs

Reliability in research. Reliability, like validity, is a way of assessing the quality of the measurement procedure used to collect data in a dissertation. In order for the results from a study to be considered valid, the measurement procedure must first be reliable.

About Our Ads

The term reliability in psychological research refers to the consistency of a research study or measuring test. For example, if a person weighs themselves during the course of a day they would expect to see a similar reading. Scales which measured weight differently each time would be of little makeshop-fz4r9hsp.cf: Saul Mcleod. The use of reliability and validity are common in quantitative research and now it is reconsidered in the qualitative research paradigm. Since reliability and validity are rooted in positivist perspective then they should be redefined for their use in a naturalistic approach. Like reliability and validity as used in quantitative research are providing .

Cookie Info

Reliability and Validity. In order for research data to be of value and of use, they must be both reliable and valid.. Reliability. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time.