Tuesday, February 05, 2019

English soccer is not normal

Are wins in football normally distributed? If they are not it might affect how we should calculate the probabilities of teams winning.

Baseball wins seem not to follow a normal distribution

There is a great R Package dataset of football results by James Curley here. This engsoccerdata has a function to generate soccer league tables of many countries over a long time period.

       league<-maketable_all(df=england[,])
       
 

team GP W D L gf ga gd Pts Pos

1 Manchester United 962 604 209 149 1856 847 1009 2021 1

and create a new column for the percentage of wins

       league<-league %>% 
    mutate(PercentW = W / GP)

p<-ggplot(data=league, aes(league$PercentW)) + geom_histogram()
#binwidth=20
p<-p + ggtitle("Percentage wins\n in English football league") +   xlab("Percentage Wins") + ylab("Number of Teams")
p<-p+theme_update(plot.title = element_text(hjust = 0.5))
p<-p + theme_bw()
       
 

       library(fitdistrplus)
library(logspline)
x<-league$PercentW
fit.norm <- fitdist(x, "norm")
plot(fit.norm)
       
 

       shapiro.test(x)       
 
Shapiro-Wilk normality test

data: x W = 0.96276, p-value = 0.0006663 Which means English football wins really do not have a normal distribution.

Goals per game are also not normally distributed. But I dont think anyone expectes them to be

       
league<-league %>% 
    mutate(GoalsPgame = gf / GP)
shapiro.test(league$GoalsPgame)

 

Shapiro-Wilk normality test data: x W = 0.92134, p-value = 4.818e-07

And for France

Shapiro-Wilk normality test

data: leagueF$PercentW W = 0.98522, p-value = 0.4699 so French football wins do not have might have (thanks for Paulfor the correction in the comments) a normal distribution. I must check the other leagues in the dataset as behaviour this different is odd.