[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index] [Thread Index] [Author Index]
Return to main CEDA-L Archive Page

judge variation



>"Z-score" is indeed a statistical procedure that attempts to 
>normalize for variations in sample mean and sample standard 
>deviation.  In my software, it is encoded using the following formula.

Let's see... having just discussed z-scores the last few weeks in my 
stats class :), I want to ask a few questions:

1) given the succeptibility (sp)? of the std. dev. to extreme values, 
"penaly" points (e.g., the 20 I awarded once at UNI) would seem to have 
an inordinate effect on scores... wouldn't they?

2) the sample size for some judges would be 6 scores (3 rounds) or maybe 
even less. Such a small distribution seems unlikely to be particularly 
reliable. I realize that Gary is also using pop mean & std. dev to 
mitigate extreme effects, but IMO this only camoflages the problem.

3) And, wouldn't speaker points be negatively skewed (since most people's 
mean will be at least 26 & the max is 30)? Z-scores don't have the same 
meaning if the distribution is not normal...

I dunno... I think controlling for judge variation is a nice idea, but I 
don't think z-scores or any other statistical measure will do it, given 
the available sample sizes. Perhaps if someone wanted to bother to keep 
files on judges across tournments which could then be consulted 
electronically to provide better samples.. but I'm not volunteering.

My answer: ditch the 30-point scale. Go to a 100-point, letter-grade 
scale. There would still be a good deal of variation, but at least the 
range would be less compressed (27-30), thus giving us more power to 
discern differences between good debaters. IMO, this would allow high-low 
points to be a better measure of achievement than they currently are.

-- Glenn


Follow-Ups:

Archive created by Jonathan Stanton (jonathan@cs.jhu.edu)
Return to main CEDA-L Archive Page