|
Featured Article
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pcnt Correct |
Disc. |
Point |
Alt. |
Pcnt |
Endorsing |
Point |
Key |
|
|
86 |
.00 |
.14 |
A |
86 |
100 100 |
.14 |
* |
|
|
|
|
|
B |
14 |
0 0 |
-.21 |
|
|
|
|
|
|
C |
|
0 0 |
|
|
|
|
|
|
|
D |
|
0 0 |
|
|
|
- Item 1. This item was too easy (proportion correct=.86). It did not discriminate well (point biserial correlation=.14 and discrimination index=0.0, unacceptably low). The key and distractors did not form a very good set, as most examinees identified the key correctly, but distractors C & D attracted no examinees and distractor B 14% of candidates.
Item 2
| Pcnt Correct |
Disc. |
Point |
Alt. |
Pcnt |
Endorsing |
Point |
Key |
|
|
43 |
.67 |
.69 |
A |
|
0 0 |
|
|
|
|
|
|
|
B |
43 |
0 67 |
.69 |
* |
|
|
|
|
|
C |
43 |
50 33 |
-.60 |
|
|
|
|
|
|
D |
14 |
50 0 |
-.26 |
|
|
- Item 2. This was a relatively difficult item (proportion correct=.43). It discriminated well (point biserial correlation=.69 and discrimination index=.67, above the acceptable range). Between them, distractors C and D attracted more examinees than did the key, B. One of the distractors (A) did not attract any examinees, which suggests it was an implausible distractor.
Item 3
| Pcnt Correct |
Disc. |
Point |
Alt. |
Pcnt |
Endorsing |
Point |
Key |
|
|
14 |
-.50 |
-.80 |
A |
14 |
0 33 |
.39 |
|
|
|
|
|
|
B |
14 |
50 0 |
-.80 |
* |
|
|
|
|
|
C |
43 |
0 33 |
.24 |
|
|
|
|
|
|
D |
29 |
50 33 |
-.09 |
|
|
- Item 3. This was a very difficult item (proportion correct=.14), which discriminated poorly (point biserial correlation=-.80 and discrimination index=-.50. More examinees chose distractor C (.24) than chose the key (-.80), with some also choosing A and D. However, it seems to have been the stronger candidates who were misled, so the item does not appear to work well.
Item 4
| Pcnt Correct |
Disc. |
Point |
Alt. |
Pcnt |
Endorsing |
Point |
Key |
|
|
43 |
.67 |
.50 |
A |
14 |
50 0 |
-.26 |
|
|
|
|
|
|
B |
29 |
0 33 |
.14 |
|
|
|
|
|
|
C |
14 |
50 0 |
-.80 |
|
|
|
|
|
|
D |
43 |
0 67 |
.50 |
* |
|
- Item 4 This was a relatively difficult item (proportion correct=.43), which discriminated quite well (point biserial correlation=.50 and discrimination index=.67, above the acceptable range). Each distractor attracted some candidates.
Item 5
| Pcnt Correct |
Disc. |
Point |
Alt. |
Pcnt |
Endorsing |
Point |
Key |
|
|
14 |
.00 |
-.21 |
A |
14 |
0 0 |
-.21 |
|
|
|
|
|
|
B |
43 |
0 100 |
.76 |
* |
|
|
|
|
|
C |
29 |
100 0 |
-.79 |
|
|
|
|
|
|
D |
14 |
0 0 |
.04 |
|
|
- Item 5. This was a very difficult item (proportion correct=.14). It did not discriminated well (point biserial correlation=.-.21 and discrimination index=.00, below the acceptable range). More examinees (.76) chose distractor B than choosing the correct option (-.21). A was specified as the correct answer, but B seems to work better. Option C attracted a few candidates from the lower scoring group (-.79), while D was a relatively weak distractor (.04).
Interpreting p-values, discrimination values or PBCC coefficients in relation to the response alternatives and total exam will depend on one's understanding and experience with how these statistics operate together in the context of the entire exam. For example, P values or PBCC values for different test items on the same exam may be identical but for entirely different reasons. This may tend to be the case when the same values are located at opposite ends of a lengthy exam.
Skewness
- Given no "pre-testing" or piloting procedure (i.e., same level or final exam apprentice scores are used as pilot data after the fact), the probability of obtaining a symmetrical distribution of scores is uncertain.
- Skewness (lopsidedness) of scores is likely. Marked skewness to the left (or low end of the scale) tends to yield rather easy items. Marked skewness to the right (or high end of the scale) tends to yield rather difficult items.
- The PBCC doesn't mean much if test items are highly skewed.
- A more symmetrical distribution is required; else one group of test takers may receive an easy exam whereas another receives a more difficult one.
- Skewness tends to decrease as the number of test items increase, regardless of P values (i.e., difficulty values) obtained.
- The Pearsonian coefficient of skewness deals with the distribution of scores. A coefficient of zero or near zero says that the test scores are symmetrically distributed and test for the attribute of the test. The formula is given by SK=3(mean-median)/SD.
- Using 78.143 as the average, 80 as the median and 9.478 as the standard deviation, we get: SK=3(78.143-80)/9.478; SK=-587 or -.6. Hence, the test is negatively skewed. An easy test.
- Given satisfactory reliability, a near symmetrical distribution of a test brings an item analysis to closure.
Establishing the Reliability of Test
- Reliability is the extent to which a trade test provides consistent, stable and dependable results. It’s an overall indicator of the quality of a test.
- The goal is to achieve high reliability and a symmetrical distribution of items from test to test.
- The primary reliability coefficient (such as Cronbach's Alpha, Spearman Brown's "split-half" correlation, or Iteman’s Kuder Richardson 20 alpha coefficient) is based on the concept of "rational equivalence" which stresses the consistency of subjects responses to all items on the test and, thus, provides a measure of the internal consistency of the test.
- The coefficient of internal consistency shows consistency of performance on different parts or items of the test taken at a single sitting, usually computed by item analysis (e.g., Alpha coefficient). The coefficient can range from 0.0 to 1.0.
- The higher the coefficient the better the internal consistency.
- High reliability is not so much a concern for groups but vital for individual assessment and certification purposes. For making decisions about individuals a reliability coefficient of .85 or better is required. Generally, coefficients between .65 to .80 are fine for educational research or groups and less than .50 is to low. The latter coefficients may not be testing an appropriate attribute for which the test is designed for, say sheet metal work. Assuming test items are valid, low reliability means that you have to increase the number of items on the exam because the range of items is limited. In addition, you can try adjusting item difficulties to get a larger spread.
- Stability coefficients (test-retest reliability) establish the consistency of performance on a test over a period of time. The Pearson's Product Moment (PPM) coefficient can be used for this purpose. In other words, PPM is used for test-retest stability. In these coefficients, the instability is traceable to the test itself, or parts of it. PPM requires two administrations. The initial test and retest.
- PPM correlation classifications for interpreting the size or strength of correlations are as follows: .01-.09 is negligible; .10-.29 is considered low; .30-.49 is moderate; .50-.69 is substantial and .70 or more is very strong. These are useful for examining construct validity but more relevant for the purposes of test-retest reliability/stability.
Establishing the Reliability of Test
- Reliability is the extent to which a trade test provides consistent, stable and dependable results. It’s an overall indicator of the quality of a test.
- The goal is to achieve high reliability and a symmetrical distribution of items from test to test.
- The primary reliability coefficient (such as Cronbach's Alpha, Spearman Brown's "split-half" correlation, or Iteman’s Kuder Richardson 20 alpha coefficient) is based on the concept of "rational equivalence" which stresses the consistency of subjects responses to all items on the test and, thus, provides a measure of the internal consistency of the test.
- The coefficient of internal consistency shows consistency of performance on different parts or items of the test taken at a single sitting, usually computed by item analysis (e.g., Alpha coefficient). The coefficient can range from 0.0 to 1.0.
- The higher the coefficient the better the internal consistency.
- High reliability is not so much a concern for groups but vital for individual assessment and certification purposes. For making decisions about individuals a reliability coefficient of .85 or better is required. Generally, coefficients between .65 to .80 are fine for educational research or groups and less than .50 is to low. The latter coefficients may not be testing an appropriate attribute for which the test is designed for, say sheet metal work. Assuming test items are valid, low reliability means that you have to increase the number of items on the exam because the range of items is limited. In addition, you can try adjusting item difficulties to get a larger spread.
- Stability coefficients (test-retest reliability) establish the consistency of performance on a test over a period of time. The Pearson's Product Moment (PPM) coefficient can be used for this purpose. In other words, PPM is used for test-retest stability. In these coefficients, the instability is traceable to the test itself, or parts of it. PPM requires two administrations. The initial test and retest.
- PPM correlation classifications for interpreting the size or strength of correlations are as follows: .01-.09 is negligible; .10-.29 is considered low; .30-.49 is moderate; .50-.69 is substantial and .70 or more is very strong. These are useful for examining construct validity but more relevant for the purposes of test-retest reliability/stability.
Establishing the Length of Trade Test
- Note that discrimination index/pbcc values increase as the number of test questions increase, regardless of the ability of an individual test question to discriminate between good and poor subjects. In other words, reliability is directly related to the number of items on the exam.
- The reliability coefficient for a 25-item test cannot be compared directly with the same coefficient for a 100-item test. Short form tests can expect drops in reliability in spite of retaining the best items. However, this can be a modest sacrifice in exchange for substantial reduction in the examinees’ response burden. Validity of an exam may become a concern as reliability decreases. The objective is to maintain very high reliability (.85 or better) while maintaining content validity. Knowing something about the population sample and content validity for each test under consideration is important.
- A coefficient of .5 is satisfactory for tests that have 10-15 items, and .8 for 50 item tests. For lengthier exams, .85 or better is still important for making decisions about individuals and certification. Note that reliability increases as the number of test items increase on an exam.
- As length of test increases, so does the standard error of measurement (SEM). So, for a 90 item test you may have a SEM of 4.2 and for a 25 item test you may get a SEM of 2.
Standard Error of Measurement (SEM)
- Another measure of test reliability is the Standard Error of Measurement (SEM). The SEM also reflects to some extent the reliability of this exam. It attempts to estimate error involved in measuring a specific test taker’s observed (raw) grade with a specific exam. Theoretically, the raw score should lie within 1 SEM of the test taker’s “true” score more than 2/3 of the time.
- The sources for SEM are many, and can include differences in test administration conditions, factors related to the examinee such as illness, fatigue, and misreading, and/or psychometric properties of the test itself.
- What helps to correct these sources for error? Quality test items, easy to understand test instructions, and following closely the prescribed procedures for administering trades examinations help reduce measurement error.
- Again, it's most practical use arises out of the need for interpreting the individual test taker's examination results. After all, confronted with decisions about a level placement into a trade program or T.Q., Provincial, and Interprovincial examination challenges on the basis of past in-school success and/or experience, Apprenticeship Staff need to know how much confidence can be placed in the test taker's score.
- As is often the case, retesting someone a hundred times is impracticable. In such cases, decision-makers attempt to estimate the probable limits between which the individual's true test score will fall after only one administration. Given high reliability to begin with, SEM is compared to a range of scores within probable limits of the area curve (e.g., 2 times out of 3 the scores will lie within so many raw-score points given 1 SEM or 66% of the time, 95% of scores will lie within so many raw-score points given 2 SEM, or even up to 99.73% certainty). As certainty increases so does standard error. That’s why 1 SEM is recommended in our calculations.
- High reliability is especially important when making decisions about individuals and certification.For example, using a weighing scale to assess a person's weight may be a valid and an appropriate measure but if one measure reads you in at 95 pounds whereas a second reads you in at 112 pounds, then one would quickly question its accuracy and consistency. In other words, someone retaking the same Interprovincial exam, say two months later, could have a huge shift in his or her scores even if their ability level does not change substantially. The 17-point margin of error representing more than 12 percent of the exam (assuming it to be 0 to 135) means there is both a high false-pass rate and a high false-failure rate. For example, a person who received a score of 70 on the exam may have scored a 91 or a 53 simply because of the unreliability of the test.
- The SEM formula is given by SEM=SD * by the square root of 1-r1.
Improving test reliability
- Increasing test length increases reliability but it does not guarantee efficiency. Improving test reliability requires much time and effort, especially if coefficients fall below .60. For example, if the coefficient of a 140 item test is found to be .60, you will need to lengthen this test 5.7 times (140 x 5.7 or 798 items) to get a correlation coefficient of .85. Such a lengthy test is impracticable to administer. Anything below .40 confirms the test items have little in common with the attribute being measured. In this case, the test constructor's measurement problem should be reconsidered, and a new, more explicit examination (test) plan is in order. Note that a reliability coefficient is satisfactory only within the limits of the time and resources provided to improve test.
- The formula is given by:
(The reliability you want) x (1-reliability you got)
(The reliability you got) (1-reliability you want)
- Using your own data into the formula, you might get:
.90 x (1-.80)
.80 x (1-.90) =.18/.08=2.25 times longer
Recommendations
As you can see, interpretation of all topics and point estimates is limited to one's understanding and experience with these topics. Here, I presented some of the basic concepts perceived to be important or useful to one's "toolbox" in dealing with item analysis. You are encouraged to read on the topic, take formal courses, and attend workshops or seminars. With experience and time, all of these guidelines can add to your proper understanding, interpretation and communication of the meaning and uses of item analysis.
Bibliography
Borg, R.W., & Gall, D.M. (1989). Educational research: An introduction (5th ed.). White Plains, NY: Longman.
Cap, Ihor. (1995). The usefulness and effectiveness of a self-instructional print module on multicultural behaviour change in apprentices in Manitoba. Unpublished doctoral dissertation, Florida State University, Tallahassee. Available from University Microfilms Inc., P.O. Box 1764, Ann Arbor, MI 48106-1764 USA. (377 pages)
Davis, B. F. (1964). Educational measurements and their interpretation. Belmont: Wadsworth Publishing Company.
Diederich, B.P. (1960). Short-cut statistics for teacher-made tests. Berkeley: Educational Testing Service, Evaluation and Advisory Series.
Freund, E.J., & Williams, J. F. (1982). Elementary business statistics: The modern approach (4th ed.). Englewood Cliffs, New Jersey: Prentice-Hall.
Jordan, M.A. (1953). Measurement in education: An introduction. New York: McGraw-Hill Book Company.
Nunnally, C. J. (1967). Psychometric theory. New York: McGraw-Hill Book Company.
Richter, J.J. (1980). The construction and partial validation of a scale to measure technology literacy of communication technology. Unpublished doctoral dissertation, West Virginia University.
Ridley, F. A. (1976). An evaluation device for assessing effectiveness of consumer education programs in home economics in Florida (Project No. VTAD – 5 F6-048). Tallahassee: The Florida State University, School of Home Economics, Department of Home Economic Education, Florida Department of Education, Division of Vocational, Technical and Adult Education.
Tuckman, W. B. (1988). Conducting educational research (3rd ed.). Orlando: Harcourt Brace Jovanovich.
Author: Ihor Cap, Ph.D.
|
|
|
You can reprint materials, published in articlesandblogs.ezreklama.com, only if you cite the author of the work and if you provide a direct link to our site. The http://articlesandblogs.ezreklama.com website and services are provided by EZREKLAMA (Manitoba, Canada). The views expressed in the articles, blogs and press releases appearing on this site are those of the writer(s) and do not necessarily reflect the views of EZREKLAMAs' editors and network members. The Editorial staff is entitled to edit the materials.

