Artificial Intelligence

Statistical Analysis on Scoring Bias | by Alexander Barriga | Oct, 2024

October 1, 2024

Reading the column names from left to right that represent the judge’s names between Jimena Hoffner and Noelia Barsel you’ll see that:

1st-5th and 11th-15th judges belong to what we will denote as panel 1.
The 6th-10th judges and 16th-20th judges belong to what we will denote as panel 2.

Notice anything? Notice how dancers that were judged by panel 2 show up in much larger proportion and dancers that were judge by panel 1. If you scroll through the PDF of this data table you’ll see that this proportional difference holds up throughout the competitors that scored well enough to advance to the semi-final round.

Note: The dancers shaded in GREEN advanced to the semi-final round. While dancers NOT shaded in Green didn’t advance to the semi-final round.

So this begs the question, is this proportional difference real or is it due to random sampling, random assignment of dancers to one panel over the other? Well, there’s a statistical test we can use to answer this question.

Two-Tailed Test for Equality between Two Population Proportions

We are going to use the two-tailed z-test to test if there is a significant difference between the two proportions in either direction. We are interested in whether one proportion is significantly different from the other, regardless of whether it is larger or smaller.

Statistical Test Assumptions

Random Sampling: The samples must be independently and randomly drawn from their respective populations.
Large Sample Size: The sample sizes must be large enough for the sampling distribution of the difference in sample proportions to be approximately normal. This approximation comes from the Central Limit Theorem.
Expected Number of Successes and Failures: To ensure the normal approximation holds, the number of expected successes and failures in each group should be at least 5.

Our dataset mets all these assumptions.

Conduct the Test

Define our Hypotheses

Null Hypothesis: The proportions from each distribution are the same.

Alt. Hypothesis: The proportions from each distribution are the NOT the same.

2. Pick a Statistical Significance level

The default value for alpha is 0.05 (5%). We don’t have a reason to relax this value (i.e. 10%) or to make it more stringent (i.e. 1%). So we’ll use the default value. Alpha represents our tolerance for falsely rejecting the Null Hyp. in favor of the Alt. Hyp due to random sampling (i.e. Type 1 Error).

Next, we carry out the test using the Python code provided below.

def plot_two_tailed_test(z_value):
# Generate a range of x values
x = np.linspace(-4, 4, 1000)
# Get the standard normal distribution values for these x values
y = stats.norm.pdf(x)# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Standard Normal Distribution', color='black')
# Shade the areas in both tails with red
plt.fill_between(x, y, where=(x >= z_value), color='red', alpha=0.5, label='Right Tail Area')
plt.fill_between(x, y, where=(x <= -z_value), color='red', alpha=0.5, label='Left Tail Area')
# Define critical values for alpha = 0.05
alpha = 0.05
critical_value = stats.norm.ppf(1 - alpha / 2)
# Add vertical dashed blue lines for critical values
plt.axvline(critical_value, color='blue', linestyle='dashed', linewidth=1, label=f'Critical Value: {critical_value:.2f}')
plt.axvline(-critical_value, color='blue', linestyle='dashed', linewidth=1, label=f'Critical Value: {-critical_value:.2f}')
# Mark the z-value
plt.axvline(z_value, color='red', linestyle='dashed', linewidth=1, label=f'Z-Value: {z_value:.2f}')
# Add labels and title
plt.title('Two-Tailed Z-Test Visualization')
plt.xlabel('Z-Score')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
# Show plot
plt.savefig(f'../images/p-value_location_in_z_dist_z_test_proportionality.png')
plt.show()
def two_proportion_z_test(successes1, total1, successes2, total2):
"""
Perform a two-proportion z-test to check if two population proportions are significantly different.
Parameters:
- successes1: Number of successes in the first sample
- total1: Total number of observations in the first sample
- successes2: Number of successes in the second sample
- total2: Total number of observations in the second sample
Returns:
- z_value: The z-statistic
- p_value: The p-value of the test
"""
# Calculate sample proportions
p1 = successes1 / total1
p2 = successes2 / total2
# Combined proportion
p_combined = (successes1 + successes2) / (total1 + total2)
# Standard error
se = np.sqrt(p_combined * (1 - p_combined) * (1/total1 + 1/total2))
# Z-value
z_value = (p1 - p2) / se
# P-value for two-tailed test
p_value = 2 * (1 - stats.norm.cdf(np.abs(z_value)))
return z_value, p_value
min_score_for_semi_finals = 7.040
is_semi_finalist = df.PROMEDIO >= min_score_for_semi_finals
# Number of couples scored by panel 1 advancing to semi-finals
successes_1 = df[is_semi_finalist][panel_1].dropna(axis=0).shape[0]  
# Number of couples scored by panel 2 advancing to semi-finals
successes_2 = df[is_semi_finalist][panel_2].dropna(axis=0).shape[0] 
# Total number of couples that where scored by panel 1
n1 = df[panel_1].dropna(axis=0).shape[0] 
# Total sample of couples that where scored by panel 2
n2 = df[panel_2].dropna(axis=0).shape[0]
# Perform the test
z_value, p_value = two_proportion_z_test(successes_1, n1, successes_2, n2)
# Print the results
print(f"Z-Value: {z_value:.4f}")
print(f"P-Value: {p_value:.4f}")
# Check significance at alpha = 0.05
alpha = 0.05
if p_value < alpha:
print("The difference between the two proportions is statistically significant.")
else:
print("The difference between the two proportions is not statistically significant.")
# Generate the plot
# P-Value: 0.0000
plot_two_tailed_test(z_value)

The Z-value is the statistical point value we calculated. Notice that it exists far out of the standard normal distribution.

The plot shows that the Z-value calculated exists far outside the range of z-values that we’d expect to see if the null hypothesis is true. Thus resulting in a p-value of 0.0 indicating that we must reject the null hypothesis in favor of the alternative.

This means that the differences in proportions is real and not due to random sampling.

17% of dance coupes judged by panel 1 advanced to the semi-finals
42% of dance couples judged by panel 2 advanced to the semi-finals

Our first statistical test for bias has provided evidence that there is a positive bias in scores for dancers judged by panel 2, representing a nearly 2x boost.

Next we dive into the scoring distributions of each individual judge and see how their individual biases affect their panel’s overall bias.